Pathosystems Biology: Computational Prediction and Analysis of Host-Pathogen … · 2020-01-17 · Pathosystems Biology: Computational Prediction and Analysis of Host-Pathogen Protein

Pathosystems Biology: Computational Prediction and Analysis ofHost-Pathogen Protein Interaction Networks

Matthew David Dyer

Dissertation submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophyin

Genetics, Bioinformatics, and Computational Biology

Bruno W. S. Sobral, Co-chairT. M. Murali, Co-chair

Brett M. TylerJoao C. Setubal

June 26, 2008Blacksburg, Virginia

Keywords: Pathosystems Biology, Host-Pathogen Interactions, Protein InteractionNetworks

Copyright 2008, Matthew D. Dyer

Pathosystems Biology: Computational Prediction and Analysis ofHost-Pathogen Protein Interaction Networks

Matthew D. Dyer

(ABSTRACT)

An important aspect of systems biology is the elucidation of the protein-protein interactions(PPIs) that control important biological processes within a cell and between organisms. Inparticular, at the cellular and molecular level, interactions between a pathogen and its hostplay a vital role in initiating infection and a successful pathogenesis. Despite recent suc-cesses in the advancement of the systems biology of model organisms to understand complexdiseases, the analysis of infectious diseases at the systems-level has not received as muchattention. Since pathogen related disease is responsible for millions of deaths and billionsof dollars in damage to crops and livestock, understanding the mechanisms employed bypathogens to infect their hosts is critical in the development of new and effective therapeu-tic strategies. The research presented here is one of the first computational approaches tostudying host-pathogen PPI networks. This dissertation has two main aims. First, we discussanalytical tools for studying host-pathogen networks to identify common pathways perturbedand manipulated by pathogens. We present the first global comparison of the host-pathogenPPI networks of 190 different pathogens and their interactions with human proteins. Wealso present the construction and analysis of three highly infectious human-bacterial PPInetworks: Bacillus anthracis , Francisella tularensis , and Yersinia pestis . The second aimof the research presented here is the development of predictive models for identifying PPIsbetween host and pathogen proteins. We present two methods: (i) a domain-based approachthat uses frequency of domain-pairs in intra-species PPIs, and (ii) a supervised machinelearning method that is trained on known inter-species PPIs. The techniques developed inthis dissertation, along with the informative datasets presented, will serve as a foundationfor the field of computational pathosystems biology.

This work was supported by Department of Defense grant #DAAD 13-02-C-0018 and Na-tional Institute of Allergy and Infectious Diseases Grant HHSN26620040035C to B. Sobral,PI.

Acknowledgments

First and foremost this dissertation owes a lot to my wife, Rose Dyer and son, Dimitry.I thank them for their understanding a patience after endless answers of “just five moreminutes”, for listening patiently as I tried to explain my ideas, and most off all for beingselfless and supporting me in my endeavors of obtaining a PhD. Without them this wouldnot have been possible. I am also very grateful to my parents for their constant support andencouragement in helping me to realize that I could achieve what ever I wanted as long as Iworked hard enough for it.

I have been extremely fortunate to have been surrounded by great teachers and mentorswho have fueled my interest in science. During my undergraduate studies at Brigham YoungUniversity I had the pleasure of working closely with Keith Crandall and David McClellanwhere I was first introduced to the field of Bioinformatics and how to do research. I thankthem for taking the time to get me involved in various projects.

My interest in host-pathogen systems began with an internship at Lawrence Livermore Na-tional Laboratory (LLNL) under the direction of Tom Slezak where I was able to work onBioinformatic approaches for designing pathogen diagnostics. I am grateful to Tom and histeam (Jason Smith, Clint Torres, Beth Vitalis, Shea Gardner, and Marisa Lam) for the trustthey placed in me and for allowing me to work on so many great projects. These experiencesbroadened my skills and expanded my interests in science. I also wish to thank Tom for hiswillingness to listen and give advice whenever I needed it.

I am grateful to both of my advisers, Bruno Sobral and T. M. Murali, who have helpedme to grow tremendously as an independent scientist. I thank Bruno for allowing me thefreedom to pursue my interests and for providing endless opportunities to present my researcharound the world and meet so many wonderful people. During my first semester at VirginiaTech I had the pleasure of taking a “Systems Biology” class taught by T. M. Murali. Theclass project for this course is what sparked my interest in the field of protein interactionnetworks. I also wish to thank Murali for our endless conversations, his willingness andpatience in explaining concepts to me numerous times, and the late nights and long hourshe put in to meet deadlines.

Finally, I would like to thank the other members of my committee, Brett Tyler and JoaoSetubal, whose probing questions and suggestions greatly improved my research.

iii

The research presented in this dissertation has been conducted with many colleagues. Ithank them for allowing me to include these results here. I am grateful to Donna Shattuck,Chris Neff, and Max Dufford at Myriad Genetics for generating the experimental data andproviding the protocols described in Chapter 4. I also wish to thank Corban Rivera for theGraphHopper code used to analyze the data in Chapter 4.

I also wish to note that the research presented in Chapter 3 was published by Dyer et al.,2008 [64] and the research presented in Chapter 6 was published by Dyer et al., 2007 [63]. Weare currently preparing manuscripts for submission of the research presented in Chapters 4and 7.

I gratefully acknowledge the financial support provided to me by the Department of Defensegrant #DAAD 13-02-C-0018 and National Institute of Allergy and Infectious Diseases GrantHHSN26620040035C.

iv

Contents

1 Introduction 1

2 A Review of Analysis of Protein Interaction Networks 5

2.1 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Node degree and hubs . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Node centrality and bottlenecks . . . . . . . . . . . . . . . . . . . . . 8

2.2 Conserved Protein Interaction Modules . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 PathBLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Graemlin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.3 Match-and-Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 The Landscape of Human Proteins Interacting with Pathogens 14

3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.2 Analysis of degree in the human PPI network . . . . . . . . . . . . . 17

3.1.3 Analysis of betweenness centrality in the human PPI network . . . . 17

3.1.4 Gene set enrichment analysis . . . . . . . . . . . . . . . . . . . . . . 18

3.1.5 Correlation of GSEA degree and centrality . . . . . . . . . . . . . . . 19

3.1.6 Functional enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.7 Biclustering of enriched functions . . . . . . . . . . . . . . . . . . . . 20

3.1.8 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

v

3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Pathogens interact with human hubs and bottlenecks . . . . . . . . . 23

3.2.2 Functions enriched in proteins interacting with pathogens . . . . . . . 27

3.2.3 The network of proteins interacting with multiple pathogens . . . . . 28

3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 The Human-Pathogen PPI Networks of Three Bacterial Pathogens: Bacil-

lus anthracis, Francisella tularensis, and Yersinia pestis 36

4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Experimental methods . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.2 Computational methods . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.3 The GraphHopper Algorithm . . . . . . . . . . . . . . . . . . . . . . 43

4.1.4 Computing basis CPIMs. . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.5 Expanding a basis CPIM. . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.6 Assessing the statistical significance of a CPIM. . . . . . . . . . . . . 46

4.1.7 CPIM Functional Enrichment . . . . . . . . . . . . . . . . . . . . . . 46

4.1.8 Merging CPIMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.9 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


4.2.1 Bacterial pathogens interact with human hubs and bottlenecks . . . . 48

4.2.2 Human proteins interacting with multiple pathogens . . . . . . . . . 49

4.2.3 Conserved protein interaction modules . . . . . . . . . . . . . . . . . 53

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 A Review on Prediction of Intra-species PPIs 57

5.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Rosetta stone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.2 Gene neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.3 Phylogenetic profiles and co-conservation . . . . . . . . . . . . . . . . 60

vi

5.1.4 Domain-domain profiles . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1.5 Protein sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.6 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.7 Graph structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Application to Host-Pathogen Systems . . . . . . . . . . . . . . . . . . . . . 77

6 Predicting Host-Pathogen PPIs Using Domain Profiles 78

6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.1.1 Predicting host-pathogen protein-protein interactions using domain-profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1.2 Prediction proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.1.3 Prediction correlated gene expression . . . . . . . . . . . . . . . . . . 81

6.1.4 Weighted functional enrichment . . . . . . . . . . . . . . . . . . . . . 82

6.1.5 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


6.2.1 Triplet proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2.2 Triplet correlation of gene expression . . . . . . . . . . . . . . . . . . 87

6.2.3 Functionally enriched subnetworks . . . . . . . . . . . . . . . . . . . 90

6.2.4 Using different protein features to build training set . . . . . . . . . . 92

6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 Supervised Learning and Prediction of Host-Pathogen PPIs 98

7.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.1.2 PPI features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.1.3 Evaluation of performance . . . . . . . . . . . . . . . . . . . . . . . . 102

7.1.4 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.2.1 Predictive feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . 103

vii

7.2.2 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2.3 Literature-based validation of predicted PPIs . . . . . . . . . . . . . . 106

7.2.4 Transferring host-pathogen PPIs . . . . . . . . . . . . . . . . . . . . 110

7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8 Outlooks and Perspectives 113

A Supplementary Figures 115

B Supplementary Tables 121

Bibliography 131

viii

List of Figures

1.1 Systems biology focuses on the study of various molecular components inorder to gain a systems-level understanding of how an organisms functions asit interacts with its environment. This image was taken with permission fromhttp://www.scq.ubc.ca/what-is-bioinformatics/. . . . . . . . . . . . . . . . . 2

2.1 Example of a graph G(V,E) consisting of a set of nodes V and a set of edges E. 6

2.2 Example pathway alignment and merged representation. (a) Vertical solidlines indicate direct protein-protein interactions within a single pathway, andhorizontal dotted lines link proteins with significant sequence similarity (BLASTE-value ≤ 0.01). An interaction in one pathway may skip over a protein inthe other (e.g., protein C), introducing a “gap”. Proteins at a particular po-sition that are dissimilar in sequence ( E-value > 0,01, e.g., proteins E and g)introduce a “mismatch.” The same protein pair may not occur more thanonce per pathway, and neither gaps nor mismatches may occur consecutively.(b) Paths are combined into a global alignment graph in which each noderepresents a homologous protein pair and links represent protein interact re-lationships of three types: direct interaction, gap (one interaction is indirect),and mismatch (both interactions are indirect). Image and caption taken withpermission from Kelley et al. [124]. Copyright 2003 National Academy ofSciences, U.S.A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

ix

2.3 Outline of Graemlin algorithm. (A) Shown here are four networks, togetherwith their phylogenetic relationship. Graemlin will multiply align all four.(B) Graemlin first performs a pairwise alignment of the two closest species.(C) Graemlin adds to the alignment the pair of nodes or single node fromthe frontier (nodes not in the alignment graph) that maximally increases thealignment score; the extension phase stops when no nodes from the frontiercan augment the alignment score. (D) Graemlin transforms the resultingalignment, together with the unaligned nodes from the original networks, intothree generalized networks for use in the next phase of progressive alignment.(E) Graemlin will perform three pairwise alignments, one for each of the newlycreated generalized networks. Image and caption taken with permission fromFlannick et al. [70]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 The network of interactions between human proteins interacting with at leasttwo viral pathogen groups. The size and color of a protein denote the numberof pathogen groups that interact with it: light blue is two, dark blue is three,green is four, yellow is five, orange is six, and red is seven. We used theCerebral plugin [15] for Cytoscape [201] to render this image. . . . . . . . . . 15

3.2 The network of interactions between human proteins interacting with at leastone bacterial pathogen group. The size and color of a protein denote thenumber of pathogen groups that interact with it: purple is one, light blue istwo, dark blue is three, and green is four. We used the Cerebral plugin [15]for Cytoscape [201] to render this image. . . . . . . . . . . . . . . . . . . . . 16

3.3 Cumulative log-log distributions of (a) node degrees and (b) centralities forfour subsets of nodes in the human PPI network: (i) red pluses are the setof all proteins in the network not interacting with a single pathogen in ourdataset; (ii) green squares correspond to the viral set; (iii) blue crosses are forthe bacterial set, and (iv) magenta squares are for the multiviral set. Numbersin parentheses represent the number of proteins in each set. The fraction ofproteins at a particular value of degree or centrality is the number of proteinshaving that value or greater divided by the number of proteins in the set. . . 24

3.4 Enriched network of human proteins annotated with “cell cycle”. The sub-set of proteins labeled as “Non-specific” are those not annotated with anyfunction more specific than “cell cycle” in GO. If a protein participates inmultiple phases, then it appears in each phase. An edge connecting two pro-teins denotes a known interaction in the human PPI network. Human proteinshighlighted in red are those known to be involved in the induction of apoptosis. 29

x

3.5 Enriched network of human proteins annotated with “nuclear transport” (blue),“nuclear membrane part” (green), “protein import” (orange), and “nuclearpore” (red). An edge connecting two proteins denotes a known interaction inthe human PPI network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Enriched network of human proteins annotated with “immune system pro-cess” (red), “response to wounding” (orange), “immune response” (green),and ”I-kappaB kinase/NF-κB cascade” (blue). The proteins in the black boxform a dense network of PPIs; we have left these edges out for clarity. Anedge connecting two proteins denotes a known interaction in the human PPInetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Overview of the experimental pipeline used to identify and study the human-bacteria protein-protein interactions. . . . . . . . . . . . . . . . . . . . . . . 37

4.2 An illustration of how GraphHopper expands a CPIM in iteration k. (a) ACPIM at the end of iteration k − 1. (b) In iteration k, GraphHopper keepsthe network in left side of the CPIM fixed and expands the network in theright side of the CPIM. The two nodes marked by arrows belong to the set Pcomputed in Step (i) of the algorithm. The node v′ found in Step (iii) is thelower of these two nodes. In Steps (iv) and (v), GraphHopper adds the thickmagenta interactions and orthology edges to the red network in the CPIM.(c) The CPIM at the end of iteration k. . . . . . . . . . . . . . . . . . . . . . 44

4.3 Venn diagram of human proteins interacting with pathogen proteins for eachof the pathogens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Cumulative log-log plots of (a) node degrees and (b) centralities for six subsetsof proteins in the human PPI network: the red curve is for the set of all pro-teins in the human PPI network; the green curve is for the set interacting withB. anthracis ; the dark blue curve is for the set interacting with F. tularensis ;the purple curve is for the set interacting with Y. pestis ; the light blue curveis for the set interacting with at least two pathogens; and the orange line isfor the set interacting with all three pathogens. For each set, the fraction ofproteins at a particular value of degree or centrality is the number of proteinshaving that value or greater divided by the number of proteins in the set.Counts in parentheses represent the number of proteins in each set. . . . . . 50

4.5 Identified interactions of human proteins involved in apoptosis. We dividethe human protein into sub-sets based on whether they induce or preventapoptosis, or whether they regulate apoptosis. Proteins in the group labeled“Non-specific” do not have an annotation more specific than “Apoptosis”in the Gene Ontology [10]. For clarity this image shows only interactionsinvolving virulence factors and “uncharacterized” proteins. . . . . . . . . . . 51

xi

4.6 Conserved modules of human-pathogen PPIs involved in (a-c) antigen bindingand processing and (d-f) immune response pathways. For clarity these imagesshow only interactions involving virulence factors and “uncharacterized” pro-teins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.1 Subset of previously published methods for predicting intra-species protein-protein interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Histidine biosynthesis and a glutamine amidotransferase domains of the en-zyme imidazole glycerophosphate synthase from Thermontoga maritima andfrom Saccharomyces cerevisiae. Dashed arrows show the sequence similari-ties between the different subunits and the solid arrow shows the structuralsimilarity. Image and caption taken with permission from Aloy et al. [5]. . . 59

5.3 Five examples of pairs of E. coli proteins predicted to interact by the Rosettastone method proposed by Marcotte et al. [153]. Each protein is shownschematically with boxes representing domains. For each example, a triplet ofproteins is pictured: The second and third proteins are predicted to interactbecause their homologs are fused in the first protein. Image and caption takenwith permission from Marcotte et al. [153]. . . . . . . . . . . . . . . . . . . 60

5.4 Overview of phylogenetic profile method. Colored boxes correspond to orthol-ogous proteins found across an input set of genomes. A value of 1 in the matrixindicates the presence of a protein within the respective genome. The pro-files indicate that proteins p2 and p7 (respectively, p3 and p6) are functionallylinked. Edges denote two profiles that differ by at most one position. . . . . 61

5.5 Overview of Pazos and Valencia, 2001 [175]. The initial multiple sequencealignments were reduced, leaving only sequences of the same species and con-sequently the tree constructed from the reduced alignments would have thesame number of leaves and the same species in the leaves. From the reducedalignments, the matrices containing the average homology for every possiblepair of proteins was constructed. Such matrices contained the structure ofthe phylogenetic tree. Finally, the similarity between the datasets of the twomatrices and implicitly the similarity between the two tress were evaluatedwith a linear correlation coefficient. . . . . . . . . . . . . . . . . . . . . . . . 62

5.6 Cross-Species Clustered Co-Conserved method proposed by Karimpour-Fard et al.

[121]. After obtaining precomputed PPI networks, all proteins are mapped toa single organism (E. coli). Next, combined, common, and unique networksare constructed and analyzed. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.7 For sequence signatures and domain-domain profiles a set of known interac-tions is used and for each interaction counts are made for observed domainpairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xii

5.8 Interolog mapping involves taking a set of known interactions from a groupof source organisms, identifying orthologs in a target organism, and mappinginteractions onto the target. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.9 Schematic diagram for constructing the vector space (V , F ) of a protein se-quence. V is the vector space of the sequence features; each feature (vi) rep-resents a triad composed of three consecutive amino acids; F is the frequencyvector corresponding to V , and the value of the ith dimension of F (fi) is thefrequency that vi triad appeared in the protein sequence. Image and cap-tion taken with permission from Shen et al. [206]. Copyright 2007 NationalAcademy of Sciences, U.S.A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.10 Illustration of the model used to assign a probability P (D|a) to the joint mul-tiple sequence alignment D of two protein families given an assignment a ofinteraction partners between them. Sequences from the same genomes havethe same color and horizontally aligned sequences are assumed to interact.The probabilities of pairs of alignment columns (ij) depend on the number oftimes nij

αβ that amino acids (αβ) occur in the corresponding columns, A depen-dence tree T and the corresponding factorization of the probability P (D|a, T )of the entire alignment given the assignment and dependence tree is illustratedat the bottom of the figure. Image and caption taken with permission fromBurger and van Nimwegen [35]. . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.11 Combination of datasets into probabilistic interactomes. Image and captiontaken with permission from Jansen et al. [113]. . . . . . . . . . . . . . . . . 72

5.12 A defective clique k of size n that contains one missing edge(

(p, q))

is thesame as the union of two complete cliques of size n-1, p and q. . . . . . . . . 75

5.13 Example of completing defective cliques in high-throughput experimental datasets.RRP43, RRP4 and RRP42 are three proteins in the yeast exosome involvedin RNA processing. Large-scale datasets have these three proteins as dis-connected yielding three separate cliques from this one complex. Image andcaption taken with permission from Yu et al. [245]. . . . . . . . . . . . . . . 76

6.1 Two enriched pairs of functions (l,m) and (f, g) where l is an ancestor of fand m is an ancestor of g. Dashed lines denote paths between functions inGO defined by parent-child relationships between them. . . . . . . . . . . . . 83

6.2 Distributions of predictions made using different cutoffs for classifying uncom-mon domains. Numbers in parenthesis represent the number of predictionsmade using that cutoff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

xiii

6.3 A layout of the predicted human-Plasmodium PPI network. Blue circles arehuman proteins. Red diamonds are Plasmodium proteins. Solid grey edgesare predicted PPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.4 Distributions of H-H-P and H-P-P intra-species distances for the triplet prox-imity analysis. Numbers in parenthesis are the number of triplets. . . . . . . 88

6.5 Distributions of H-H-P and H-P-P Spearman’s correlations for the triplet co-expression analysis. Numbers in parenthesis are the number of triplets. . . . 89

6.6 Life cycle of the parasite Plasmodium falciparum. . . . . . . . . . . . . . . . 90

6.7 Number of predicted interactions resulting from the use of different filters dur-ing the training and testing steps. ALL is the set of all proteins, G is the setof proteins annotated with Gene Ontology functions of interest, M is the setof proteins containing transmembrane-domains, P is the set of proteins con-taining a signal peptide, and S is the set of proteins predicted to be secreted,but do not contain a signal peptide. The first nine groups were trained usingall proteins, followed by the use of the filters, whereas the next eight groupswere trained and tested on sets of filtered proteins. The dark-blue bars arewhen we do not use the BLAST-Filter and the light-blue groups are when weuse the BLAST-Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.8 Number of enriched functions identified amongst predicted interactions result-ing from the use of different filters during the training and testing steps. ALLis the set of all proteins, G is the set of proteins annotated with Gene Ontol-ogy functions of interest, M is the set of proteins containing transmembrane-domains, P is the set of proteins containing a signal peptide, and S is the setof proteins predicted to be secreted, but do not contain a signal peptide. Thefirst nine groups were trained using all proteins, followed by the use of the fil-ters, whereas the next eight groups were trained and tested on sets of filteredproteins. The dark-blue bars are when we do not use the BLAST-Filter andthe light-blue groups are when we use the BLAST-Filter. . . . . . . . . . . . 95

6.9 Overlap of the predicted interactions resulting from the use of different filtersduring the training and testing steps. ALL is the set of all proteins, G isthe set of proteins annotated with Gene Ontology functions of interest, M isthe set of proteins containing transmembrane-domains, P is the set of pro-teins containing a signal peptide, and S is the set of proteins predicted tobe secreted, but do not contain a signal peptide. The first nine groups weretrained using all proteins, followed by the use of the filters, whereas the nexteight groups were trained and tested on sets of filtered proteins. Each cell isthe Jaccard coefficient of the predicted interactions between the correspond-ing row and column datasets. The Jaccard values range from 0 (red) to 0.5(black) to green (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

xiv

6.10 Overlap of the enriched functions amongst the predicted interactions resultingfrom the use of different filters during the training and testing steps. ALL isthe set of all proteins, G is the set of proteins annotated with Gene Ontol-ogy functions of interest, M is the set of proteins containing transmembrane-domains, P is the set of proteins containing a signal peptide, and S is the setof proteins predicted to be secreted, but do not contain a signal peptide. Thefirst nine groups were trained using all proteins, followed by the use of the fil-ters, whereas the next eight groups were trained and tested on sets of filteredproteins. Each cell is the Jaccard coefficient of the predicted interactions be-tween the corresponding row and column datasets. The Jaccard values rangefrom 0 (red) to 0.5 (black) to green (1). . . . . . . . . . . . . . . . . . . . . 97

7.1 Precision/recall curves using different k-mer sizes for different TP:TN ratios. 104

7.2 Precision/recall curves using different feature combinations for different TP:TNratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.3 Image taken with permission from Brass et al. [33] and adapted to show: 1)HDFs for which we make no predictions (red), 2) HDFs for which we makepredictions using a 1:1 or 1:10 ratio in training (blue), and 3) HDFs for whichwe make predictions using a 1:25 ratio or higher during training (green). . . 108

7.4 Precision/Recall of a DKN model trained using human-HIV interactions fordifferent TP:TN ratios and used to predict in other viral systems. . . . . . . 111

8.1 Number of publications in PubMed for studies on the protein interaction net-work of the model organism Saccharomyces cerevisiae. The pubmed querywas “Saccharomyces cerevisiae AND protein interaction”, restricted by thepublication year of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

A.1 Log-log scatter-plot of each protein contained within the whole human PPInetworks used in Chapter 3, which contains 11,463 proteins. The x-axis is thedegree and the y-axis is the centrality of a protein within its respective network.116

A.2 Log-log scatter-plot of each protein contained within the high-throughput hu-man PPI networks used in Chapter 3, which contains 4,986 proteins. Thex-axis is the degree and the y-axis is the centrality of a protein within itsrespective network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

A.3 Log-log scatter-plot of each protein contained within the manually curatedhuman PPI networks used in Chapter 3, which contains 8,704 proteins. Thex-axis is the degree and the y-axis is the centrality of a protein within itsrespective network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

xv

A.4 Overlap of the predicted interactions resulting from the use of different filtersduring the training and testing steps and the application of a post-predictionBLAST-filter (see Chapter 6). ALL is the set of all proteins, G is the set ofproteins annotated with Gene Ontology functions of interest, M is the set ofproteins containing transmembrane-domains, P is the set of proteins contain-ing a signal peptide, and S is the set of proteins predicted to be secreted, butdo not contain a signal peptide. The first nine groups were trained using allproteins, followed by the use of the filters, whereas the next eight groups weretrained and tested on sets of filtered proteins. Each cell is the Jaccard coeffi-cient of the predicted interactions between the corresponding row and columndatasets. The Jaccard values range from 0 (red) to 0.5 (black) to green (1). . 119

A.5 Overlap of the enriched functions amongst the predicted interactions resultingfrom the use of different filters during the training and testing steps and theapplication of a post-prediction BLAST-filter (see Chapter 6). ALL is the setof all proteins, G is the set of proteins annotated with Gene Ontology func-tions of interest, M is the set of proteins containing transmembrane-domains,P is the set of proteins containing a signal peptide, and S is the set of pro-teins predicted to be secreted, but do not contain a signal peptide. The firstnine groups were trained using all proteins, followed by the use of the filters,whereas the next eight groups were trained and tested on sets of filtered pro-teins. Each cell is the Jaccard coefficient of the predicted interactions betweenthe corresponding row and column datasets. The Jaccard values range from0 (red) to 0.5 (black) to green (1). . . . . . . . . . . . . . . . . . . . . . . . 120

xvi

List of Tables

3.1 Summary of experimental methods and literature support for the host-pathogenPPIs in our dataset. The experimental group “Not specified” denotes thatthere was no observation method listed in any database. Note that everyinteraction in this study must have at least one piece of literature support. . 21

3.2 Overview of human-pathogen PPIs. For each pathogen group we list the totalnumber of PPIs involving pathogen proteins in that group, the number ofstrains in that group, the number of unique human proteins interacting withthat group, and the number of these that interact with at least one otherhuman protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Summary of GSEA results with and without human-HIV interactions for threenetworks: Whole human PPI network (W), the human PPI network yielded byHigh-Throughput experiments (HT), and the human PPI network consistingonly of Manually Curated PPIs (MC). We report p-values only for the sets ofhuman proteins in Figure 3.3. The “#proteins in group” column displays thetotal number of human proteins in that group. The “ES” column displays theenrichment score calculated by GSEA. . . . . . . . . . . . . . . . . . . . . . 26

4.1 Summary of human-pathogen interactions. . . . . . . . . . . . . . . . . . . . 48

4.2 Summary of the number of identified CPIMs for each of the algorithms usedin this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.1 Distribution of the number of predicted host–pathogen PPIs for differentranges of probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1 Summary of the number of predicted interactions for the different TP:TN ra-tios using the DKN model. The dataset contained 1,028 TP PPIs, of which 74involve HIV dependency factors. . . . . . . . . . . . . . . . . . . . . . . . . . 107

xvii

B.1 Relative occurrences of four types of nodes in each of the three networks usedin Chapter 3. The “Fraction” column defines the cutoff at which a protein isconsidered a hub or a bottleneck. The other columns represent the fractionof hub-bottleneck (H-B), non-hub-bottleneck (NH-B), hub-non-bottleneck (H-NB), and non-hub-non-bottleneck (NH-NB) proteins in the network using thatcutoff. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

B.2 Summary of GSEA degree results with and without human-HIV PPIs fromChapter 3. The “#proteins in group” column displays the total number ofhuman proteins in that group. The “ES” column displays the enrichmentscore calculated by GSEA. The column titled “#proteins contributing” dis-plays the number of proteins contributing to the ES score. The column titled“Jaccard’s” lists the Jaccard coefficient between the two sets of proteins con-tributing to the ES score for degree and for centrality (see Table B.3 forcentrality results). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

B.3 Summary of GSEA centrality results with and without human-HIV PPIs fromChapter 3. The “#proteins in group” column displays the total number ofhuman proteins in that group. The “ES” column displays the enrichmentscore calculated by GSEA. The column titled “#proteins contributing” dis-plays the number of proteins contributing to the ES score. The column titled“Jaccard’s” lists the Jaccard coefficient between the two sets of proteins con-tributing to the ES score for degree and for centrality (see Table B.2 for degreeresults). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

B.4 Summary of GSEA degree results for individual pathogen groups from Chap-ter 3. The “#proteins in group” column displays the total number of humanproteins in that group. The “ES” column displays the enrichment score calcu-lated by GSEA. The column titled “#proteins contributing” displays the num-ber of proteins contributing to the ES score. The column titled “Jaccard’s”lists the Jaccard coefficient between the two sets of proteins contributing tothe ES score for degree and for centrality (see Table B.5 for centrality results). 125

B.5 Summary of GSEA centrality results for individual pathogen groups fromChapter 3. The “#proteins in group” column displays the total number ofhuman proteins in that group. The “ES” column displays the enrichmentscore calculated by GSEA. The column titled “#proteins contributing” dis-plays the number of proteins contributing to the ES score. The column titled“Jaccard’s” lists the Jaccard coefficient between the two sets of proteins con-tributing to the ES score for degree and for centrality (see Table B.4 for degreeresults). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

xviii

B.6 Summary of GSEA results for protein degree of human proteins for threenetworks from Chapter 4; (W) whole human PPI network, (HT) the humanPPI network generated by only considering high-throughput experiments, and(C) the human PPI network generated by only considering manually curatedPPIs. The “# protein in group” displays the total number of human proteinstargeted. The “ES” column displays the enrichment score calculated by theGSEA for degree. The column titled “# proteins contributing” displays thenumber of proteins contributing to the ES score (see Table B.7 for centralityresults). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

B.7 Summary of GSEA results for protein betweenness centrality of human pro-teins for three networks from Chapter 4; (W) whole human PPI network,(HT) the human PPI network generated by only considering high-throughputexperiments, and (C) the human PPI network generated by only consideringmanually curated PPIs. The “# protein in group” displays the total numberof human proteins targeted. The “ES” column displays the enrichment scorecalculated by the GSEA for centrality. The column titled “# proteins con-tributing” displays the number of proteins contributing to the ES score (seeTable B.6 for degree results) . . . . . . . . . . . . . . . . . . . . . . . . . . 128

B.8 Accuracy values for each feature set described in Chapter 7. Unless statedotherwise, the feature K corresponds to four-mers. . . . . . . . . . . . . . . 129

B.9 Accuracy, precision, and recall values for each human-viral systems using theDKN model trained with human-HIV PPIs described in Chapter 7. . . . . . 130

xix

Chapter 1

Introduction

Sophisticated high-throughput biological experiments to interrogate a cell provide a widerange of functional genomic information about cell state. These advances, combined withthe public availability of the resulting datasets, herald the era of systems biology. Nu-merous, diverse, and rich resources exist for studying the systems biology of organismssuch as Escherichia coli , Saccharomyces cerevisiae, Caenorhabditis elegans , and Drosophila

melanogaster . Experimental and computational methods developed for these organisms arenow being extended to mammals. These advances are helping to shed light on how cellsrespond to various conditions as an organism interacts with its environment.

An important aspect of systems biology is to gain an understanding of how the differentmolecule e.g., metabolites, DNA, RNA, and proteins, come together and interact with eachother to form the machinery that controls critical cellular processes. These interactions arealso responsible for executing processes key to cellular survival when the organism as a wholeinteracts with its environment (see Figure 1.1). An area of systems biology that has receiveda lot of attention is the elucidation and study of protein-protein interactions (PPIs). Theterm “PPI” refers to the association or physical interaction of two or more proteins. Theseinteractions can take many forms ranging from a direct interaction between two proteinsto two proteins that are part of the same complex. A PPI may even refer to proteinsthat belong to the same biochemical pathway. Identified PPIs have been pieced togetherto form large Protein Interaction Networks (PINs). PINs have been an important resourcefor comparative genomic studies [62, 124, 204], predicting PPIs [154, 244], determining thefunction of unannotated proteins [120, 205], identifying molecular machines in Caenorhabditis

elegans [87]. These techniques have been primarily applied to intra-species networks ofmodel organisms e.g., Escherichia coli and Saccharomyces cerevisiae. The availability ofexperimental biological data has been the driving force for studying these model organisms.

In nature, an organism rarely acts as an autonomous entity. Rather, it interacts with otherorganisms in its environment. Such “cross-species” interactions are the foundations of func-tioning ecosystems. At the cellular and molecular level, interactions between proteins from

1

2

Figure 1.1: Systems biology focuses on the study of various molecular components in orderto gain a systems-level understanding of how an organisms functions as it interacts with itsenvironment. This image was taken with permission from http://www.scq.ubc.ca/what-is-bioinformatics/.

two different species play an important role in establishing such communications. Suchcross-species interactions can be beneficial for all parties involved. For example, the α-proteobacterium Sinorhizobium meliloti has formed a symbiotic relationship with its host,Medicago truncatula, in which S. meliloti fixes nitrogen for M. truncatula in exchange forprotection from the environment. However, not all cross-species interactions are symbiotic,in fact many are pathogenic. Examples of such pathogenic systems include those of many in-fectious diseases e.g., HIV, Influenza, and Plasmodium falciparum. Pathogen related diseaseresults in millions of deaths each year. P. falciparum alone is responsible for an estimated 1.5to 2.7 million deaths yearly. Host-pathogen systems are not restricted to human diseases.Every year billions of dollars worth of damage to crops and livestock are due to pathogenicorganisms. Millions of dollars are spent annually to better understand how pathogens infecttheir hosts and to identify potential targets for therapeutics.

PPIs play an important role in initiating infection and a successful pathogenesis. Thus, thecomputational study of Host-Pathogen PIN (HP-PINs) holds the promise of helping biologist

3

to better understand mechanisms of pathogenesis and provide new technologies for the rapididentification of therapeutic targets. For example, comparative analysis of multiple HP-PINs may allow us to identify conserved mechanisms pathogen cellular invasion and providepotential targets for therapeutics. However, the study of HP-PINs is made difficult by threemain factors:

1. Host-pathogen PPI data is scarce. With the exception of the human-HIV system,the number of experimentally validated host-pathogen PPIs is extremely scarce. Thisis because high-throughput screens have primarily been used to detect intra-speciesPPIs [77, 81, 99, 108, 109, 142, 194, 214, 232]. With the exception of recent high-throughput screens [36] and those presented in Chapter 4, host-pathogen PPIs havetraditionally been derived from small-scale experiments, which tend to focus on aparticular pathogen protein or pathway of interest.

2. Pathogens can interact with their hosts in multiple ways. The location andstrategy of host-pathogen PPIs can vary greatly depending on the type of pathogen.For example, some bacterial organisms are extra-cellular and interact only with themembrane of the host cell. Others are intra-cellular i.e., they enter and replicatewithin the host cell and may interact with many host proteins found within the hostcell. Viral systems have their own unique properties. Since viruses lack the machinerynecessary to transcribe and translate their own genetic material, they usually subverthost machinery and manipulate host cellular processes. Such variation typically makesit impossible to apply a single experimental technology across all host-pathogen systemsfor identifying biologically relevant PPIs.

3. Phenotypic outcomes vary amongst related pathogens. The phenotypic effectsof a pathogen on a host and subsequent immune system response can vary greatly.For example, influenza infection is much more severe in infants and the elderly thanin middle-aged people. Evolutionary pressure on pathogens tends to favor the repro-ductive rate and level of virulence that maximizes pathogen fitness. However, giventhat different hosts offer varying selective environments to pathogens, one level of vir-ulence may not be appropriate for all host types [178]. The environment in whichthe pathogen propagates may be relatively unpredictable, resulting in the ability to“feel out” the environment using biochemical cues associated with immune responsesto evaluate the host environment and decide on the proper strategy for infection [226].If such variation is a result of the underlying PPIs then performing a single screen ofa host-pathogen system may not be sufficient to capture all of the important PPIs.

These difficulties could be avoided by identifying and testing the significance of all PPIsacross all host-pathogen systems. However, given the innumerable number of such systemsand endless variation at the allelic level between strains, such an approach is not feasible.Therefore, we argue that an important application of the systems biology of host-pathogen

4

systems, hereafter referred to as pathosystems biology, is the development of computationalmodels which can identify important strategies used by pathogens during infection and ef-ficiently predict HP-PINs. The research presented here provides the first such reportedmethods and lays the foundation for the field of computational pathosystems biology withrespect to the study of HP-PINs. While we focus on H. sapiens-pathogen PINs, the sameapproaches can be generalized to analyze any host-symbiont or host-pathogen system.

Organization

The research has been organized into two major aims: (i) methods for studying and analyzingknown experimentally-verified HP-PINs and (ii) methods for predicting HP-PINs.

Chapters 2 through 4 focus on the first aim. We begin in Chapter 2 with a background ofmethods that have been used to study intra-species PINs and to identify conserved patternsof PPIs across multiple PINs. In Chapter 3, we provide a global analysis of H. sapiens-pathogen PINs for 190 different pathogens. In Chapter 4, we present the first experimen-tally derived large-scale H. sapiens–pathogen PINs for three bacterial pathogens; Bacillus

anthracis , Francisella tularensis , and Yersinia pestis . We use an algorithm for identifyingconserved modules in across these HP-PINs. These comparative approaches allow us to an-swer important questions such as: What are the properties of human proteins that interactwith pathogens? Do pathogens interact with certain functional classes of human proteins?Which infection mechanisms and pathways are commonly triggered by multiple pathogens?

Chapters 5 through 7 focus on the second aim of developing predicting methods for identify-ing physical interactions between human and pathogen proteins. We begin with a review ofmethods that have been previously used to predict PINs in a single organism in Chapter 5.We discuss some of the difficulties of applying these methods to host-pathogen systems. InChapter 6, we present an extension of the domain-profile method (previously used to predictintra-species PPIs [211]) to predict PPIs between H. sapiens and P. falciparum proteins.This approach is useful when the number of known host-pathogen PPIs is very limited, butthere are a large number of intra-species PPIs available for both the host and pathogen sys-tems (or closely related organisms). In Chapter 7 we apply a support vector machine (SVM)machine-learning algorithm to a publicly available dataset of human-HIV interactions andexplore a variety of protein features to identify which combination performs the best at pre-dicting HP-PINs. This approach is useful when there are a large number of known PPIsfor the host-pathogen system of interest. In this chapter we also asses the utility of using aclassifier trained on one human-pathogen system, to predict PPIs in other human-pathogensystems.

Finally, we conclude with an outlook and some perspectives on the future of computationalpathosystems biology in Chapter 8.

Chapter 2

A Review of Analysis of ProteinInteraction Networks

Experimental techniques such as mass-spectrometry [76, 77, 99, 128] and yeast two-hybrid [194,214, 232] have been used to identify protein interaction networks (PINs) in a number of modelorganisms e.g., Saccharomyces cerevisiae, Escherichia coli , and Drosophila melanogaster .The recent growth in available data for these, and other organisms, has led to the birth ofthe field of biological network analysis and comparative interactomics, or comparative anal-ysis of intra-species PINs. These analyses have typically focused on two aspects: first, thestructure of the network, and second, identifying subnets or modules, conserved by evolutionacross multiple intra-species PINs. We do not discuss other approaches that aim to identifymodules of PINs activated under a particular stress by overlaying gene expression data ontothe PIN [107, 233]. In this chapter, we focus on computational methods for studying thestructure of PINs and identifying conserved modules between PINs.

2.1 Network Structure

A PIN is usually represented as an undirected graph G=(V,E) consisting of a set V of nodes(proteins) and a set E of edges (interactions), each of which connects a pair of nodes in V .Important properties of a node include its degree and centrality.

2.1.1 Notation

The degree of a node v ∈ V is defined as the number of edges incident on v. For example,in Figure 2.1, the degree of the orange node is four and the degree of the red node is seven.Nodes with high degree are often referred to as “hubs”. The betweenness centrality of a node

5

6

is defined as the fraction of all shortest paths in a graph that travel through that node (seeSection 3.1.3 on page 17 for a precise definition). Nodes with high betweenness centrality arecharacteristic of bottlenecks in the graph. An example of such a node is the green node inFigure 2.1. A path π(u, v) in G is a sequence of k ≥ 2 nodes u = u1, u2, u3, . . . , uk = v suchthat for each 1 ≤ i ≤ k, (ui, ui+1) is an edge in G. The length of such a path is k; π(u, v) issaid to connect u and v. The shortest path πp(u, v) is the path of shortest length amongstall paths that connect u and v.

2.1.2 Node degree and hubs

Researchers have argued that the degree distribution of PPI networks is scale-free and followsthe power law i.e., the fraction of proteins in the network interacting with k other proteinsis proportional to k−γ, for some γ greater than zero, typically between two and three [4, 14].One feature of scale-free networks is that they are robust in the face of attacks on randomnodes. The removal of random subsets of nodes in scale-free networks only gradually increasesthe average shortest path length [2, 140]. However the selective removal of even a smallnumber of nodes of high degree can dramatically change the structure of the network, bybreaking it up into multiple disconnected components [2, 140].

Several other interesting biological phenomena have been observed in connection with hubs

Figure 2.1: Example of a graph G(V,E) consisting of a set of nodes V and a set of edges E.

7

in biological networks. These observations are mostly based on analysis of the PPI networkof Saccharomyces cerevisiae, since this organism has the most comprehensive interactiondatasets.

Hubs are evolutionarily conserved. It has been suggested that a protein’s rate ofevolution depends both on its dispensability to the organism and on the proportion of po-tential amino acid changes that are compatible with proper protein function [239]. Severalstudies have shown that hubs in protein interaction networks tend to be evolutionarily con-served i.e., have lower rates of mutation, when compared to proteins from closely relatedorganisms [19, 73], although this observation has also been argued to be an artifact of anunaddressed bias in high-throughput datasets [19]. Two hypotheses have been put forwardto explain this phenomenon. First, if the interacting partners of a hub protein interact withdifferent residues on the hub protein, then a greater proportion of the hub protein is involvedin its function. Second, hubs could evolve more slowly, not because a greater proportion ofthe sequence is required for proper function, but because the entire sequence is subject tostronger selection against slightly deleterious mutations [73].

Hubs are critical for cellular survival. Several studies suggest that hub proteins tendto be essential i.e., the removal of such proteins results in the death of the cell [90, 94, 114],although this observation has also been debated [49]. However, a recent examination ofthese issues using a comprehensive literature-curated dataset of well-substantiated PPIs inSaccharomyces cerevisiae shows that while the use of less reliable yeast two-hybrid data alonecan reject the possibility that local connectivity correlates with measures of dispensability,a relatively robust correlation is observed in higher quality datasets [19].

Hubs control fine-tuning of cellular processes. When comparing the differential geneexpression measurements of proteins across multiple conditions, hub proteins tend to havelow levels of change [151]. A recent study indicates that at least some biological responses,for example, an allergic immune response, are mediated by larger expression changes in nodeswith low connectivity and smaller changes in hub proteins [151].

Hub proteins have another two other unusual functions, namely rapid turnover and a highdegree of regulation. Hub proteins have high mRNA decay rates and a large number ofphosphorylation sites when compared to non-hub proteins. It has been suggested that theseproperties may be an adaptation to minimize unwanted activation of pathways that mightbe mediated by accidental binding to hubs, were they to actively persist longer than requiredat any given time point [19].

Additionally, hub proteins are highly correlated in expression with their partners, and pre-sumably interact with them at similar times [90, 243].

8

Hubs allow for redundancy and gene duplications. Redundancy among proteins hasalso been recently linked with hubs [119]. These studies show that redundant proteins arenot randomly distributed within the protein interaction network but are rather strategicallyallocated to the most highly connected proteins. This design is appealing because it suggeststhat many of the potentially vulnerable nodes that would otherwise be highly sensitive tomutations are often protected by redundancy [119].

2.1.3 Node centrality and bottlenecks

As is the case with hubs, the removal of nodes with high betweenness centrality can alsoresult dramatic change in network structure e.g., increases the number of connected com-ponents [82, 98]. Bottleneck proteins have also been observed to have several of the samebiologically interesting properties as hub proteins.

Bottlenecks are evolutionarily conserved. Most proteins do not evolve in isolation,but as components of complex genetic networks. Therefore, a protein’s position in a net-work may indicate how central it is to cellular function and, hence, how constrained it isevolutionarily. A recent study of the Saccharomyces cerevisiae, Drosophila melanogaster ,and Caenorhabditis elegans PINs showed that proteins that are central in each of the threenetworks, regardless of the number of direct interactors, evolve more slowly [89]. This char-acteristic has been also observed in other studies [118].

Bottlenecks link functional modules. A recent study of the fission yeast, Saccha-

romyces pombe, shows that bottleneck proteins are responsible for linking functional modulesrelated to the cell cycle. In fact bottleneck proteins often correspond to known cell cyclecheckpoints [38]. The observation that bottleneck proteins tend to connect functionallycoherent modules has been used to study the structure of PINs [60].

Bottlenecks are critical for cellular survival. Bottlenecks are, in fact, key connec-tor proteins with surprising functional and dynamic properties. Yu et al. [243] dividedall proteins in the S. cerevisiae PPI network into four categories to study the effects ofboth degree and centrality: 1) non-hub–non-bottlenecks; 2) hub–non-bottlenecks; 3) non-hub–bottlenecks; and 4) hub–bottlenecks. They observed that bottlenecks (both non-hub–bottlenecks and hub–bottlenecks) have a strong tendency to be products of essential genes,whereas hub–non-bottlenecks are surprisingly not essential [243]. Thus, they show that inPPI networks betweenness centrality is a much more significant indicator of essentiality thandegree [243]. Such a correlation between essentiality and bottlenecks has also been observedin D. melanogaster and C. elegans PINs [89].

9

Bottleneck proteins show dynamic expression patterns. Besides the strong correla-tion with essentiality, bottlenecks also correspond to the dynamic components of a PIN. Theyare significantly less well coexpressed with their neighbors than non-bottlenecks, implyingthat expression dynamics is wired into the network topology [243].

2.2 Conserved Protein Interaction Modules

With the number of identified PPIs growing for an increasing number of organisms, a naturalquestion is whether or not there are conserved patterns of interactions across multiple PINs.A Conserved Protein Interaction Module (CPIM) is two or more protein sub graphs thatshare cross-species similarity at the node level (homology of corresponding protein sequences)and graph structure level (pattern of interactions). Here we review three algorithms foridentifying CPIMs.

2.2.1 PathBLAST

Kelley et al. [124] identified CPIMs by combining PINs of two species into a single “alignmentgraph” (see Figure 2.2). A node in the alignment graph represents two orthologous proteins,one from each PPI network, that share at least weak sequence similarity (BLAST E-value≤ 0.01). An edge in the alignment graph represents an interaction that is conserved in bothPPI networks. They added an edge to the alignment graph only if the proteins contributing tothe nodes were connected through at most one intermediate protein in the respective sourcePIN. The weight of an edge represented the likelihood that the corresponding interactionsare conserved; this weight depends on the degree of orthology between the proteins and onassessed confidence estimates that the individual PPIs indeed take place in the cell. For anode v in the alignment graph, they defined p(v) as the probability of true homology withinthe protein pair represented by a particular vertex v in the alignment graph. Given an edge ein the alignment graph they defined q(e) as the probability that the interaction representedby E was real and not a false positive. The probability p(v) was defined using Bayes ruleand a set of known ortholog groups [225]. Likewise the probability q(e) was computed basedon the number of supporting pieces of evidence in the literature. They defined backgroundprobabilities, prandom and qrandom, that were the expected values of p(v) and q(e) over allthe vertices and edges in the global alignment graph. Finally, they defined the score of aconserved path P , as

S(P ) =∑

v∈P

log10

p(v)

prandom

+∑

e∈P

log10

q(e)

qrandom

(2.1)

This score is direct measure of a path’s importance in comparison to all other paths of equal

10

size. Computing the paths with the highest score NP-Complete problem. Kelly et al. avoidedthis problem by converting the alignment graph into a directed acyclic graph (DAG). Theydid this by randomly assigning direction to each edge in the alignment graph and checkingto ensure no cycles were introduced. They implemented a dynamic-programming algorithmthat allowed them to quickly identify paths of high scores in the DAG (see Kelley et al. [124]).Repeating this process multiple times enabled them to find high scoring paths in the original(undirected) graph with high probability. They applied their method to identify conservedpathways between S. cerevisiae and H. pylori . The same group later described the Network-

Figure 2.2: Example pathway alignment and merged representation. (a) Vertical solid linesindicate direct protein-protein interactions within a single pathway, and horizontal dottedlines link proteins with significant sequence similarity (BLAST E-value ≤ 0.01). An inter-action in one pathway may skip over a protein in the other (e.g., protein C), introducing a“gap”. Proteins at a particular position that are dissimilar in sequence ( E-value > 0,01, e.g.,proteins E and g) introduce a “mismatch.” The same protein pair may not occur more thanonce per pathway, and neither gaps nor mismatches may occur consecutively. (b) Paths arecombined into a global alignment graph in which each node represents a homologous pro-tein pair and links represent protein interact relationships of three types: direct interaction,gap (one interaction is indirect), and mismatch (both interactions are indirect). Image andcaption taken with permission from Kelley et al. [124]. Copyright 2003 National Academyof Sciences, U.S.A.

11

BLAST algorithm to find complexes conserved between S. cerevisiae and H. pylori [203]and to find both pathways and complexes conserved between three species, S. cerevisiae,D. melanogaster , and C. elegans [204].

2.2.2 Graemlin

Flannick et al. [70] introduced an algorithm named Graemlin. They used explicit modelsof functional evolution to quickly align many protein interaction networks. Flannick et al.

permitted searches for arbitrary module structures by requiring the user to specify an edge-scoring matrix to encapsulate the desired module structure. Given a set of input networks,together with the phylogenetic relationship of the corresponding organisms, they began byperforming a pairwise alignment of the two closest species (see Figure 2.3(A) and 2.3(B)).Flannick et al. then generated a set of clusters from each network where each node (protein)and its d-1 closest neighbors constitute a d-cluster. They then scored all pairs of d-clustersby finding for each pair the highest scoring mapping among nodes and selecting the pairsthat score greater than a user-specified threshold t. In short, the score for a pair of alignednodes, v1 and v2, was computed as the log-odds ratio of the probabilities obtained from thealignment model PrM and the random model PrR.

SN(v1, v2) = logPrM(v1, v2)

PrR(v1, v2)(2.2)

PrM and PrR was defined as probability distributions over BLAST alignments; each spec-ified the probability of obtaining every possible BLAST bit-score. Flannick et al. thentransformed all high-scoring pairs into seeds by aligning the two highest scoring nodes andadding all nodes connected to the node currently in the alignment (see Figure 2.3(C)). Thisstep was performed by adding the node that maximally increased the alignment score. Theextension phase continued until the addition of more nodes to the alignment did not increasethe alignment score. Flannick et al. then transformed the resulting alignment, together withthe unaligned nodes from the original networks, into three generalized networks for use inthe next phase of progressive alignment (see Figure 2.3(D)). Finally, Flannick et al. per-formed three pairwise alignments, one for each of the newly created generalized networks(see Figure 2.3(E)). Flannick et al. applied Graemlin to multiple organisms, such as E. coli

and C. crescentus , in order to identify CPIMs.

2.2.3 Match-and-Split

Finally, Narayanan and Karp [165] presented a method that compares the PINs of two speciesto detect functionally similar CPIMs without constructing an alignment graph. They calledtheir algorithm match-and-split. Given a pair of input graphs G and H, their algorithm

12

first identified locally matching proteins between the two networks, and discarded the otherproteins, which share no homology. The matched two proteins if (i) the BLAST E-value wasat most 1 × 10−7 and (ii) each protein was among the ten best BLAST hits of the other.Next, they split the input graphs into connected sets of proteins. They used combinatorialcriteria to decide when the local neighborhoods of a pair of orthologs match. Under theirmodel, they proved that a given pair of proteins can belong to at most one CPIM. Thisobservation lead to a top-down partitioning algorithm that found all maximal CPIMs in

Figure 2.3: Outline of Graemlin algorithm. (A) Shown here are four networks, togetherwith their phylogenetic relationship. Graemlin will multiply align all four. (B) Graemlinfirst performs a pairwise alignment of the two closest species. (C) Graemlin adds to thealignment the pair of nodes or single node from the frontier (nodes not in the alignmentgraph) that maximally increases the alignment score; the extension phase stops when nonodes from the frontier can augment the alignment score. (D) Graemlin transforms theresulting alignment, together with the unaligned nodes from the original networks, into threegeneralized networks for use in the next phase of progressive alignment. (E) Graemlin willperform three pairwise alignments, one for each of the newly created generalized networks.Image and caption taken with permission from Flannick et al. [70].

13

polynomial time. These connected sub graphs were locally matching with respect to thefull input networks. The match and split steps were repeated on each pair of sub graphsrecursively. Narayanan and Karp used their method to identify CPIMs by doing pairwisecomparisons of four PINs: Caenorhabditis elegans , Drosophila melanogaster , Homo sapiens ,and Saccharomyces cerevisiae.

Chapter 3

The Landscape of Human ProteinsInteracting with Pathogens

PPI-mediated mechanisms of infection have been studied in detail for many pathogens [69,79, 93, 95, 131, 161]. However, many questions are relatively unexplored. What are theproperties of human proteins that interact with pathogens? Do pathogens interact withcertain functional classes of human proteins? Which infection mechanisms and pathwaysare commonly triggered by multiple pathogens? A significant hurdle to such global cross-pathogen comparisons has been the shortage of large-scale datasets of interactions betweenhost and pathogen proteins. High-throughput experimental screens have been primarilyused to identify intra-species PPIs [77, 81, 99, 108, 109, 142, 194, 214, 232]. However, recentefforts to include host-pathogen PPIs in public databases have made it easier to acquire thedata needed to address these important questions.

In this chapter, we integrate experimentally verified human-pathogen PPIs for 190 pathogenstrains from seven public databases [80, 86, 96, 117, 159, 196, 246]. We partition thestrains into 54 different pathogen groups, where each group is made up of taxonomicallyrelated strains. We analyze the intra-species network of PPIs between the 1,233 uniquehuman proteins spanned by the host-pathogen PPIs, and find that pathogens, both viraland bacterial, tend to interact with hubs (proteins with many interacting partners) andbottlenecks (proteins that are central to many paths in the network) in the human PPInetwork.

We pay special attention to two networks of PPIs between human proteins: the proteinsthat interact with at least two viral pathogen groups (see Figure 3.1) and the proteinsthat interact with at least two bacterial pathogen groups (see Figure 3.2, noting that thefigure also contains human proteins interacting with only one bacterial pathogen group).We compute the Gene Ontology (GO) [10] functions enriched in each of these two sets ofhuman proteins. Such enriched functions highlight human pathways that may be involved ininfection mechanisms that are common to multiple pathogens. Examples of such processes

14

15

and components include cell cycle regulation, I-κB kinase/NF-κB cascade, and the nuclearmembrane. These functions shed light on a number of features shared by different pathogens:interacting with human transcription factors and key proteins that control the cell cycle;transport of genetic material through the nuclear membrane (in the case of viruses) tosubvert the host’s transcriptional machinery; triggering an immune response via toll-likereceptors; and activation of NF-κB signaling. We discuss in detail the importance of theseand other enriched functions, as well as the proteins they annotate and the pathogens they

Figure 3.1: The network of interactions between human proteins interacting with at leasttwo viral pathogen groups. The size and color of a protein denote the number of pathogengroups that interact with it: light blue is two, dark blue is three, green is four, yellow isfive, orange is six, and red is seven. We used the Cerebral plugin [15] for Cytoscape [201] torender this image.

16

Figure 3.2: The network of interactions between human proteins interacting with at leastone bacterial pathogen group. The size and color of a protein denote the number of pathogengroups that interact with it: purple is one, light blue is two, dark blue is three, and green isfour. We used the Cerebral plugin [15] for Cytoscape [201] to render this image.

interact with. Overall, these results provide the first global view of aspects of human cellularprocesses that are controlled by and respond to pathogens.

Our results should be interpreted with caution since no single pathogen may interact withall the proteins and PPIs we analyze. In addition, data for bacterial pathogens are scarce.However, we suggest that piecing together pathogen-interacting human proteins across mul-tiple pathogens has the potential to provide insights into common molecular mechanisms ofinfection and proliferation used by different pathogens.

17

3.1 Methods

3.1.1 Notation

We represent the set of known interactions between human proteins as an undirected graphG(V,E), where V is the set of nodes (proteins) and E is the set of edges (interactions). Let Pbe the set of pathogen groups. We say that a pathogen group P interacts with a humanprotein s if s interacts with a protein in P . For a pathogen group P ∈ P, we define VP ⊆ V tobe the set of human proteins that interact with P . Let T =

⋃

P∈P be the set of proteins thatinteract with at least one pathogen. Let TV (respectively, TB) be the set of human proteins

that interact with at least one viral (respectively, one bacterial) group. Let T(k)V ⊆ TV

(respectively, T(k)B ⊆ TB) be the set of human proteins that interact with at least k viral

(respectively, bacterial) pathogen groups; by definition, T(1)V ≡ TV and T

(1)B ≡ TB. Finally,

let V ‘ be the set of human proteins not interacting with a single pathogen in our dataset.We now describe in detail the tests we use to analyze TB, TV , T

(2)B , T

(2)V , and 54 VP sets.

(See Section 3.1.8 for an explanation of how we obtained 54 such sets.)

3.1.2 Analysis of degree in the human PPI network

The degree of a protein in a graph is the number of interactions in which it participates, notincluding self-interactions. We plot distributions of the degrees of four sets of proteins in G:(i) V ‘, the set of all proteins in G not interacting with a single pathogen in our dataset;(ii) TB, the set of all human proteins interacting with at least one bacterial pathogen group;(iii) TV , the set of all human proteins interacting with at least one viral pathogen group;

and (iv) T(2)V , the set of human proteins interacting with at least two viral pathogen groups.

In this analysis, we ignore T(2)B since it contains only 20 proteins. If the distributions of TB

and TV are more biased towards high degree proteins than the distribution for V ‘, then wehypothesize that viral and bacterial pathogens have evolved to interact with hub proteins inthe human PPI network.

3.1.3 Analysis of betweenness centrality in the human PPI net-work

The degree of a protein captures only its local connectivity. Centrality captures both globaland local features of a protein’s importance in a network. In this paper, we use the notionof a protein’s betweenness centrality [74]. A protein with high betweenness centrality ischaracteristic of a bottleneck in an interaction network (i.e., there are many paths that passthrough this protein) [243].

18

We define the betweenness centrality bc(v) of a protein v as the fraction of shortest pathsin G between all protein pairs (u, w) that pass through the protein v. Given u, v, w ∈ V ,let σuw denote the number of shortest paths1 between proteins u and w and let σuw(v) denotethe number of these that pass through v. Then the betweenness centrality of v is

bc(v) =∑

u,w∈Vu,w 6=v

σuw(v)

σuw

In our analysis, we divide bc(v) by the number of pairs of nodes in G, yielding a quantitybetween 0 and 1. We use the algorithm devised by Brandes [32] to compute the betweennesscentrality of all nodes in G. This algorithm runs in time proportional to the product ofthe number of nodes in G and the number of edges in G. As with the degree analysis, weplot distributions of the betweenness centrality for V ‘, TB, TV , and T

(2)V . If the distributions

for TB, TV , and T(2)V are biased toward higher values of centrality than the distribution for V ‘,

we hypothesize that pathogens have evolved to interact with bottlenecks in the human PPInetwork.

3.1.4 Gene set enrichment analysis

Let L be the ranked list of the proteins in V , where we rank the proteins either by degree orby betweenness centrality. Given L and a predefined set S of proteins of interest (e.g., thoseinteracting with HIV), we use a Gene Set Enrichment Analysis (GSEA) to determine whetherthe proteins contained in S are randomly distributed throughout L or concentrated at thetop. In the ranked list L, let li be the value (of degree or centrality) at index i, 1 ≤ i ≤ |L|.We abuse notation and say that an index i is an element of S if the protein whose rank is ibelongs to S. First, we compute m =

∑

i∈L li, the sum of all the values in L. Next, for eachindex i in L, we compute two values:

Phit(S, i) =∑

j∈S,j≤i

ljm

Pmiss(S, i) =∑

j /∈S,j≤i

1

|L| − |S|

Thus, Phit(S, i) measures the weighted fraction of proteins with index at most i that are inS and Pmiss(S, i) measures the fraction of proteins with index at most i that are not in S.

1There may be multiple equally long paths between u and w that are shorter than any other pathbetween u and w.

19

We handle multiple ranks with identical values by computing these two values only at thelargest rank for each unique value in L. Finally, we define the enrichment score as the largestpositive value of Phit(S, i) − Pmiss(S, i), i.e.,

es(S, L) = max1≤i≤|L|

(

max(

Phit(S, i) − Pmiss(S, i), 0)

)

A large positive value of es(S, L) indicates that the proteins in S have high degree or highbetweenness centrality. Note that our modification of the original definition of the enrich-ment score [219] ensures that if S mainly contains proteins with low degree or betweennesscentrality, then the score will be close to 0, since Phit(S, i) − Pmiss(S, i) will be negativefor most indices. We record the rank i that yields es(S, L). To compute p-values for anobserved enrichment score s, we generate a null distribution of scores by repeatedly select-ing |S| random nodes in L and computing the score for each random subset of nodes. Werepeat this process 1,000,000 times and estimate the p-value for s as the fraction of ran-dom sets whose score is at least as large as s. We obtain our results by testing each of 57sets: TB, TV , T

(2)V , and the sets VP corresponding to each of the 54 pathogen groups.

3.1.5 Correlation of GSEA degree and centrality

We perform three analyses to discount the possibility that a human protein’s degree andits centrality are correlated. First, we construct scatter plots of each protein’s degree andcentrality. Second, Yu et al. [243] classify a protein as a hub or bottleneck by first sortingproteins according to the degree (respectively, centrality) and by classifying proteins with thetop 20% of degree values (respectively, centrality values) as hubs (respectively, bottlenecks).Using different cutoffs of classification (10%, 20%, 30%, and 40%) of a protein as a hubor as a bottleneck, we calculate the fraction of hub–bottleneck, non-hub–bottleneck, hub–non-bottleneck, and non-hub–non-bottleneck proteins. Finally, for each GSEA analysis, wedetermine the set of proteins that contribute to the ES score (see 3.1.4). For each pathogengroup and for the virus set, the bacteria set, and the multivirus set, we compute the Jaccards’coefficient of the set of proteins that contribute to the degree ES score and the set thatcontribute to the centrality ES score.

3.1.6 Functional enrichment

We isolate functionally coherent subsets of human proteins among the sets TB, TV , T(2)B , T

(2)V ,

and the sets VP corresponding to each of the 54 pathogen groups using a test for functionalenrichment. Given the hierarchical structure of the Gene Ontology (GO) [10], we account fordependencies between annotations by using the method proposed by Grossman et al. [85].Let S be a set of proteins of interest (e.g., the set of proteins interacting with HIV). We aimto compute GO functions that annotate a surprisingly large number of proteins in S. To

20

this end, for each function f in GO, we count sf , the number of proteins in S annotatedwith f and spa(f), the number of proteins in S annotated by at least one parent of f . Wealso compute vf and vpa(f), the number of proteins in V annotated by f and by at leastone parent of f , respectively. With these four counts in hand, we use the hypergeometricdistribution to compute the probability pf (S, V ) of drawing sf or more proteins from a setof vf marked proteins when we select spa(f) proteins at random from a universe of vpa(f)

proteins:

pf (S, V ) =

min(spa(f),vf )∑

k=sf

(

vf

k

)(

vpa(f)−vf

spa(f)−k

)

(

vpa(f)

spa(f)

)

We account for multiple hypothesis testing using the method of Benjamini and Hochberg [23].We consider only functions enriched with a p-value of at most 0.05. Note that differentenriched functions may annotate identical sets of human proteins. In each such case, wegroup the functions and associate the most enriched function (and its p-value) with thegroup. To report enrichment ranks, we sort the groups in increasing order of p-value.

3.1.7 Biclustering of enriched functions

We compute enriched functions in each of the 54 sets of human proteins interacting witheach pathogen group. We construct a binary matrix whose rows are enriched functions andwhose columns are pathogen groups. An entry is one in this matrix if and only if the functionis enriched with a corrected p-value of at most 0.05 in the pathogen group. In this binarymatrix, we define a bicluster to be a subset R of rows and a subset C of columns such thateach row-column pair in R ×C contains a one. We also require a bicluster to be closed i.e.,each row not in R (respectively, column not in C) contains a zero in at least one columnin C (respectively, row in R). We use the Bimax algorithm to compute all closed biclustersin this binary matrix [182].

3.1.8 Datasets used

We downloaded all datasets used in this study in August 2007. We gathered 10,477 experi-mentally detected and manually curated protein-protein interactions (PPIs) between humanand pathogen proteins and 75,457 experimentally verified PPIs between human proteinsfrom primary literature [36] and seven databases: the Biomolecular Interaction NetworkDatabase [80], the Database of Interacting Proteins [196], the Human Protein ReferenceDatabase [159], IntAct [96], the Molecular INTeraction database [246], the Munich Infor-mation Center for Protein Sequences [86], and Reactome [117]. Table 3.1 contains statisticson the experimental methods that yielded these PPIs and the literature support for the

21

Table 3.1: Summary of experimental methods and literature support for the host-pathogenPPIs in our dataset. The experimental group “Not specified” denotes that there was noobservation method listed in any database. Note that every interaction in this study musthave at least one piece of literature support.

Support Method #PPIs Fraction of PPIs

Experimental

Reactome curated 7,229 0.69Not specified 2,305 0.22Yeast two-hybrid 419 0.04Pull down 314 0.03Coimmunoprecipitation 210 0.02Other technology (28 methods) 210 0.02Two or more experimental methods 260 0.02

LiteratureDescribed in one paper 9,810 0.94Described in two papers 198 0.02Described in more than two papers 469 0.04

PPIs. These interactions cover 190 different pathogen strains. Two pathogens—HIV andHepatitis—account for 88.4% (9,268) of the human-pathogen PPIs. To mitigate this bias,we merge pathogen strains into 54 groups based on taxonomic similarity: each pathogen

group contains pathogens belonging to the same genus or, in the case of viruses, the samefamily. We construct lists of unique human proteins interacting with each group. Ta-ble 3.2 summarizes the number of interactions acquired for each pathogen group. For someanalyses, we consider a human PPI network assembled from unbiased high-throughput ex-periments [68, 194, 214] and a network constructed from only manually curated humanPPIs [117, 159]. These networks contain 13,324 and 59,396 interactions respectively. Weobtained functional annotations from the Gene Ontology (GO) [10].

Table 3.2: Overview of human-pathogen PPIs. For each pathogen group we list the totalnumber of PPIs involving pathogen proteins in that group, the number of strains in thatgroup, the number of unique human proteins interacting with that group, and the numberof these that interact with at least one other human protein.

Group # PPIs # Strains # Unique targeted # Targeted proteinshuman proteins in human PPI network

HIV 8,024 44 743 671Hepatitis 1,244 16 109 93Influenza 287 4 76 76Papillomavirus 229 12 96 94Epstein-Bar virus 211 2 135 121Adenovirus 80 9 60 59Herpesvirus 64 20 54 54

Continued on next page. . .

22

Table 3.2 (Continued)Group # PPIs # Strains # Unique targeted # Targeted proteins

human proteins in human PPI networkYersinia 57 3 56 45Sarcoma virus 52 6 36 35T-lymphotrophic virus 25 2 23 23E coli 22 1 20 20Chlamydia 19 1 19 19Neisseria 16 1 16 16Streptococcus 14 5 8 8Vaccinia virus 13 4 7 7Staphylococcus 12 3 10 7Pseudomonas 11 1 9 9Measles virus 10 3 4 4Polyomavirus 8 3 8 8Leukemia virus 7 1 7 6Shigella 5 1 4 4Anemia virus 4 4 2 1Bacillus 4 3 4 4Hantaan virus 4 1 4 4SARS 4 1 4 4Clostridium 3 3 2 2Dengue virus 3 3 3 3Rotavirus 3 3 2 2Echovirus 3 2 1 1Helicobacter 3 2 2 2Salmonella 3 1 3 3Seoul virus 3 1 3 3Listeria 3 1 2 2SIV 2 2 2 2Orf virus 2 2 1 1Foamy virus 2 1 2 2Puumala virus 2 1 2 2Stomatitis virus 2 1 2 2Mycoplasma 2 1 2 1Sendai virus 1 1 1 1Nucleopolyhedrovirus 1 1 1 1Rabies virus 1 1 1 1Toxoplasma 1 1 1 0Poliovirus 1 1 1 1Nipah virus 1 1 1 1

Continued on next page. . .

23

Table 3.2 (Continued)Group # PPIs # Strains # Unique targeted # Targeted proteins

human proteins in human PPI networkKlebsiella 1 1 1 1Enterobacteria 1 1 1 1Mokola virus 1 1 1 1West Nile virus 1 1 1 1Tula virus 1 1 1 1Corynephage 1 1 1 1Ebola virus 1 1 1 1Campylobacter 1 1 1 1Plasmodium 1 1 1 1TOTAL 10,477 190 1,233 1,109

3.2 Results and Discussion

The human-pathogen PPIs involve 1,233 unique human proteins, of which 1,109 are knownto interact with at least one other human protein. Of these 1,233 human proteins, 221interact with at least two pathogen groups (182 with more than one viral pathogen and 20with more than one bacterial pathogen).

3.2.1 Pathogens interact with human hubs and bottlenecks

As previously mentioned (see Chapter 2), researchers have argues that the degree distributionof PPI networks is scale-free and follows the power law and that such networks are robust inthe face of attacks on random nodes [2, 140]. However the selective removal of even a smallnumber of nodes of high degree can dramatically change the topology of the network [2, 140].

There is considerable debate on the origins of the scale-free property and whether this prop-erty is an artifact of experimental biases and errors [91, 188, 218]. Notwithstanding thisdebate, we reason that pathogens may have evolved to interact with human proteins thatare hubs (those involved in many interactions) or bottlenecks (those central to many path-ways) [243] to disrupt key proteins in complexes and pathways. Our results support thishypothesis. Figure 3.3(a) displays the cumulative log-log plot of the degree distribution offour sets of proteins in the human PPI network: (i) all proteins not interacting with atleast one pathogen in our dataset, (ii) “Viral set,” the subset of proteins interacting withat least one viral pathogen group, (iii) “Bacterial set,” the subset of proteins interactingwith at least one bacterial pathogen group, and (iv) “Multiviral set,” the subset of proteinsinteracting with at least two viral pathogen groups.2 These plots show that across almost

2We did not include the “Multibacterial” set of human proteins interacting with two or more bacterial

24

0.0001

0.001

0.01

0.1

1

1 10 20 30 40 50 70 90 140 190 240 340

Fra

ctio

n of

Pro

tein

s

Degree of Human Protein in Human PPI Network

Non-pathogen interactors (9,110)Viral set (1,029)

Bacterial set (108)Multiviral set (182)

(a) Cumulative degree distribution.

0.0001

0.001

0.01

0.1

1

1e-08 2.5e-08 5e-08 1e-07 2.5e-07 5e-07 1e-06

Fra

ctio

n of

Pro

tein

s

Centrality of Human Protein in Human PPI Network

Non-pathogen interactors (9,110)Viral set (1,029)

Bacterial set (108)Multiviral set (182)

(b) Cumulative betweenness centrality distribution.

Figure 3.3: Cumulative log-log distributions of (a) node degrees and (b) centralities for foursubsets of nodes in the human PPI network: (i) red pluses are the set of all proteins in thenetwork not interacting with a single pathogen in our dataset; (ii) green squares correspondto the viral set; (iii) blue crosses are for the bacterial set, and (iv) magenta squares are forthe multiviral set. Numbers in parentheses represent the number of proteins in each set. Thefraction of proteins at a particular value of degree or centrality is the number of proteinshaving that value or greater divided by the number of proteins in the set.

pathogen groups in this analysis, since there are only 20 such proteins.

25

the entire range of degrees, proteins interacting with viral and bacterial pathogen groupstend to have higher degrees than human proteins not interacting with pathogens. Further,proteins interacting with at least two viral pathogens have higher degrees than proteins in-teracting with one or more viral pathogens. The betweenness centrality results display thesame trend (see Figure 3.3(b)). Across the entire range of values, proteins interacting withviral and bacterial pathogens have higher betweenness centrality. These results suggest thatpathogens may have evolved to interact with human hub and bottleneck proteins, perhapsbecause these proteins control critical processes in the host cell [19, 49, 90, 94, 114, 243].

We use Gene Set Enrichment Analysis (GSEA) [219] to test whether the gaps we observed inFigure 3.3 are statistically significant. GSEA is a method developed to assess the significanceof the differential expression of a pre-defined gene set in two phenotypes of interest [219].GSEA ranks all genes by a suitable measure of differential expression (e.g., the t-statistic)and uses a modified Kolmogorov-Smirnov test to assess if the genes in the given set havesurprisingly high or low ranks. Since distributions of the t-statistics of differentially expressedgenes have been observed to follow a power-law distribution [101], we reason that GSEAmay be appropriate to test whether the human proteins interacting with pathogens havesurprisingly high degree or betweenness centrality.

Our GSEA results support the conclusions we draw from Figure 3.3 that pathogens pref-erentially interact with human protein hubs and bottlenecks: for each of the three sets ofproteins plotted in Figure 3.3, GSEA yields a p-value of at most 3.5 × 10−5 (degree) and2.3× 10−4 (centrality). To alleviate the concern that the observed patterns may be artifactsof experimental biases or errors in the human PPI network, we repeat each of the analysesusing two subsets of the human PPI network: a network composed of 13,324 PPIs detectedonly by high-throughput studies [68, 194, 214] and a network with 59,396 PPIs constructedusing only manually curated interactions [117, 159]. The top half of Table 3.3 summarizesthese results. For all three networks, the viral set, the bacterial set, and the multiviral setare significant at the 0.05 level for both degree and centrality, with the exception of themultiviral set in the high-throughput network. Since 77.9% of the human-pathogen PPIsare for the human-HIV system, we repeat these analyses for each network after removing allhuman-HIV PPIs and obtain similar results (see the bottom half of Table 3.3).

Next we aim to determine if there is any correlation between the results we observe for degreeand centrality. Figures A.1, A.2, and A.3 in the appendix show that there is significant de-correlation between these two quantities in each of the three networks. In particular, thereare many hub-non-bottleneck (top-left portion of each plot) and non-hub-bottleneck proteins(bottom-right portion of each plot). Similar findings of non-hub-bottleneck and hub-non-bottleneck proteins have been found in networks of other model organisms [243].

Using different cutoffs of classification (10%, 20%, 30%, and 40%) of a protein as a hub oras a bottleneck, we calculate the fraction of hub-bottleneck, non-hub-bottleneck, hub-non-bottleneck, and non-hub-non-bottleneck proteins contained in each of the three networks.Table B.1 of the appendix shows that in each network, there are a non-negligible number of

26

Table 3.3: Summary of GSEA results with and without human-HIV interactions for threenetworks: Whole human PPI network (W), the human PPI network yielded by High-Throughput experiments (HT), and the human PPI network consisting only of ManuallyCurated PPIs (MC). We report p-values only for the sets of human proteins in Figure 3.3.The “#proteins in group” column displays the total number of human proteins in that group.The “ES” column displays the enrichment score calculated by GSEA.

All Human-Pathogen PPIsNetwork Protein Set #proteins Degree Centrality

in group ES p-value ES p-value

WVirus 1,029 0.79 < 1 × 10−6 0.83 < 1 × 10−6

Multivirus 182 0.84 < 1 × 10−6 0.86 1.2 × 10−4

Bacteria 108 0.76 3 × 10−5 0.89 2.3 × 10−4

HTVirus 466 0.68 < 1 × 10−6 0.82 1.5 × 10−3

Multivirus 98 0.65 0.03 0.82 0.10Bacteria 43 0.79 2 × 10−3 0.89 0.02

MCVirus 958 0.78 < 1 × 10−6 0.80 < 1 × 10−6

Multivirus 174 0.83 < 1 × 10−6 0.82 1.9 × 10−5

Bacteria 100 0.73 3.4 × 10−4 0.85 9.2 × 10−4

Without Human-HIV PPIs

WVirus 499 0.80 < 1 × 10−6 0.85 < 1 × 10−6

Multivirus 81 0.83 < 1 × 10−6 0.88 3.6 × 10−3

HTVirus 267 0.70 1 × 10−6 0.84 6.12 × 10−4

Multivirus 46 0.72 0.02 0.86 0.07

MCVirus 958 0.78 < 1 × 10−6 0.80 < 1 × 10−6

Multivirus 174 0.83 < 1 × 10−6 0.85 1.3 × 10−3

nonhub-bottleneck proteins.

Tables B.2 and B.3 of the appendix are a copy of Table 3.3, except for additional columnslisting the number of proteins contributing to the ES score and the Jaccards’ coefficient ofthese two sets (the last column). As the last column indicates, the Jaccards’ coefficientranges from 0.45 to 0.7, which suggests that while there is some overlap between the sets ofhuman proteins contributing to the two ES scores, there are also proteins that contributeonly to the degree ES score or only to the centrality ES score.

Finally, Tables B.4 and B.5 in the appendix present the GSEA results for all pathogens withat least seven human protein interactors that are enriched for both GSEA analyses. Wechoose seven as a cutoff, since the remaining pathogens interact with three or fewer humanproteins.

27

3.2.2 Functions enriched in proteins interacting with pathogens

We compute over-represented GO terms in 58 sets of human proteins: the bacterial set,the viral set, the multibacterial set, the multiviral set, and the 54 sets of human proteinsinteracting with each of the 54 pathogen groups. Overall, we found 404 unique GO termsenriched in these sets.

We identify at least one enriched function in 21 pathogen groups. Analysis of these dataidentified 91 biclusters, each containing between two and seven pathogen groups and betweentwo and 40 enriched GO functions. We focus on two of the biclusters below. The biclustersdemonstrate that our analysis can group different enriched functions together even if theeffects of the interactions on the host cell or the participating host proteins are different.

Our first example is a bicluster spanning the three pathogen groups Adenovirus, HIV, andPapillomavirus and 23 GO functions. GO biological processes in the bicluster include “cellcycle process” and “regulation of cellular process”. GO cellular components in the bicluserinclude “membrane-enclosed lumen” and “pore complex”. The membrane-enclosed lumen isthe space within a sealed membrane or between two sealed membranes. Proteins annotatedwith these functions include KPNA2, a karyopherin, the histone deacetylases HDAC1 andHDAC2, and a number of Transcription Factors (TFs). KPNA2 plays an important rolein both the import and export of material through the nuclear membrane. Interactionswith KPNA2 enable a virus to enter the nucleus and take over the host’s transcriptionalmachinery [9, 54, 75, 183]. HDACs play an important role in silencing gene expression byremoving acetyl groups from histones, thus causing them to wrap more tightly around DNAand blocking the binding of TFs. The role played by pathogen-HDAC interactions variesamong pathogen groups. In the case of Adenovirus, it has been suggested that the pathogenprotein E1B interacts with HDAC1/SIN3 to produce an enzymatically active complex thatmay be capable of repressing the transcriptional activity of the human TP53 protein inorder to block apoptosis [184]. In contrast, the E7 Papillomavirus protein binds to theHDAC complex to promote cell growth, eventually leading to cervical cancer [34].

The second example is a bicluster containing a virus (HIV) and three bacteria (Chlamydia,Neisseria, and E. coli). This bicluster contains eleven GO functions including the biologicalprocesses “immune response”, “response to stimulus”, and “cytokine production”. Althoughthese four groups of pathogens interact with proteins belonging to the same pathways, thefunctions of the interactions are different. In the case of the bacteria, these functions anno-tate such proteins as toll-like receptors (TLRs) and interleukin receptor-associated kinases(IRAKs), which are special classes of host proteins responsible for recognizing foreign ma-terial and activating an immune response. There are no reported interactions with theseproteins and HIV, although some researchers suggest that the single-stranded RNA of HIV-1 may encode many TLR7/TLR8 ligands [157]. In contrast to the bacteria in the bicluster,HIV uses host proteins involved in immune response such as CD4, CCR5, and CXCR4 togain entrance to the cell. HIV attaches to the host protein CD4, a T-cell glycoprotein, andsubsequently to host chemokine receptors CCR5 and CXCR4. These binding events cause

28

conformational changes to host proteins that allow the membrane of the virus to fuse to thehost cell membrane [102].

3.2.3 The network of proteins interacting with multiple pathogens

The biclustering analysis of the previous section suggests that specific sets of pathogen groupsmight trigger or interact with the same human pathways and processes. Encouraged by thesedata, we ask if there are infection pathways commonly targeted or triggered by at least twoviral or bacterial pathogen groups. To answer this question, we construct two networks ofhuman proteins: one where every protein interacts with at least two viral pathogen groupsand the other where every protein interacts with at least two bacterial pathogen groups. Ineach network, we include every PPI connecting two proteins in the network. Figures 3.1and 3.2 display these networks. (Note that Figure 3.2 also contains human proteins thatinteract with only one bacterial pathogen group.) We compute the enriched GO functions inthese two networks. We group and highlight some of the enriched functions and relevant sub-networks below. Throughout our discussion, we will refer to the localization of proteins inthe four main regions of Figures 3.1 and 3.2: extracellular, the cell membrane, the cytoplasm,and the nucleus. For every GO function that we discuss, we mention its p-value and rank inthe sorted list of all functions enriched in the corresponding network.

Human proteins targeted by multiple viral pathogens

Our analysis highlights a number of important mechanisms that viral pathogens use to ma-nipulate the human cell: i) control the host cell cycle program to ensure the transcriptionof viral genetic material; ii) utilize human TFs to promote the transcription of viral ge-netic material; iii) target key human proteins that regulate critical cellular processes suchas apoptosis; and iv) subvert host machinery for transporting material across the nuclearmembrane.

Control the host cell cycle program Many viral pathogens are known to manipulatehost cell cycle processes [17, 207, 209]. Our enrichment results reflect these findings. Ouranalysis identifies a sub network of human proteins interacting with multiple viral pathogengroups enriched in the biological process “cell cycle” (p-value 6.2 × 10−6, rank 21/89). Fig-ure 3.4 displays this network. In this figure, we use GO annotations to clarify in which phaseof the cell cycle each protein participates. The proteins in this figure are scattered throughthe cytoplasm and nucleus regions of Figure 3.1.

Two stages of the cell cycle are enriched in our analysis: “G1 phase” (p-value 0.004, rank52/89) and “Interphase” (p-value 0.01, rank 60/89). G1 is the initial stage of the cell cycle. Inthis phase, a number of proteins needed for DNA replication are transcribed and translated.

29

CCNE1

HIV

Papillomavirus

CDK4

HIV

Herpesvirus

T-lymphotrophic virus

CDK8

HIV

Herpesvirus

DLG1

Adenovirus

Papillomavirus


CDK2

HIV

Herpesvirus

Papillomavirus

Shigella

CDK6

Herpesvirus


E2F1

HIV

Papillomavirus

E2F4

HIV

Papillomavirus

JUN

HIV

Papillomavirus

MYC

HIV

Papillomavirus

Sarcoma virus

PML

Adenovirus

Herpesvirus

RB1

Adenovirus

HIV

Herpesvirus

Papillomavirus

Sarcoma virus

TAF1

Adenovirus

HIV

Papillomavirus CDK8

HIV

Herpesvirus

DLG1

Adenovirus

Papillomavirus


CDK4

HIV

Herpesvirus


CCNE1

HIV

Papillomavirus

CDK2

HIV

Herpesvirus

Papillomavirus

PPP2R1A

HIV

Papillomavirus

Sarcoma virus

CDC2

Adenovirus

HIV

Papillomavirus

Sarcoma virus

KPNA2

Adenovirus

EBV

HIV

Papillomavirus

Sarcoma virus

KHDRBS1

HIV

Sarcoma virus

KPNA2

Adenovirus

EBV

HIV

Papillomavirus

Sarcoma virus

CDK2

HIV

Herpesvirus

Papillomavirus

RB1

Adenovirus

HIV

Herpesvirus

Papillomavirus

Sarcoma virus

KHDRBS1

HIV

Sarcoma virus

RAN

Adenovirus

HIV

Influenza

Papillomavirus

Sarcoma virus

UBE2I

EBV

Hantaan virus

Papillomavirus

Seoul virus

SHC1

HIV

Polyomavirus

Regulation of cell cycle

CDN1A

HepatitisPapillomavirus

TP53

Adenovirus

HIV

Hepatitis

Papillomavirus

Polyomavirus

SIV

Sarcoma virus

TP73

HIV

Hepatitis

Papillomavirus

PCAF

Adenovirus

Foamy virus

HIV

Papillomavirus

PP2CA

HIV

Papillomavirus

Sarcoma virus

PTMA

EBV

HIV

Hepatitis


RBL2

HIV

PapillomavirusSMARCB1

HIV

Papillomavirus

SERTAD1

EBV

Papillomavirus

STAT1

Adenovirus

HIV

Hepatitis

ZMYND11

Adenovirus

EBV

CDK8

HIV

Herpesvirus

CDK4

HIV

Herpesvirus


UBC

HIV

HerpesvirusEP300

Adenovirus

HIV

Papillomavirus

Polyomavirus

BAX

Adenovirus

HepatitisEIF2AK2

HIV

Hepatitis

Influenza

Vaccinia

PPP1CA

EBV

HIV

NPM1

HIV

Hepatitis

Herpesvirus

Non-specific

LCK

HIV

Hepatitis

Herpesvirus

S p

hase

G

1 phase

M

itosis

G2 phase

Figure 3.4: Enriched network of human proteins annotated with “cell cycle”. The subset ofproteins labeled as “Non-specific” are those not annotated with any function more specificthan “cell cycle” in GO. If a protein participates in multiple phases, then it appears in eachphase. An edge connecting two proteins denotes a known interaction in the human PPInetwork. Human proteins highlighted in red are those known to be involved in the inductionof apoptosis.

A direct link between pathogen interference and the G1 phase has been established forHIV [130]. The HIV TAT protein elongates the G1 phase in order to promote viral geneexpression. Of the thirteen human proteins in Figure 3.4 that participate in G1, ten areknown to interact with TAT. One of these interactions is with the human protein RB1, aretinoblastoma-associated protein and a known tumor suppressor, which can repress genestranscribed by the E2F family of transcription factors that are required for entering theS phase of the cell-cycle [92]. RB1 interacts with five pathogens in total: Adenovirus,Herpesvirus, HIV, Papillomavirus, and Simian virus [71, 103, 145, 147, 181]. In the caseof HIV, the TAT protein interacts with the human RB1 protein to manipulate normal cell-cycle conditions and promote viral gene expression. The HIV long terminal repeat (LTR)is responsible for integrating viral DNA into the host genome and also acts as a promoterand enhancer of viral proteins. The LTR is most active in the early G1 phase and theactivity of the LTR diminishes as the cell progresses through the G1 phase and enters the

30

S phase [130]. Therefore, the extension of the G1 phase may increase activity of the LTRand the eventual production of more viral proteins. In the case of Papillomavirus, the VE6protein in Papillomavirus has been shown to manipulate the cell cycle by altering mitoticcheckpoint fidelity through its effect on CDC2 activity and inactivation of TP53 [229]; itinteracts with ten human proteins in Figure 3.4.

The human DLG1 protein is a “discs large homolog” that is essential for the transitionfrom the G1 to S phase of the cell cycle. This protein interacts with three pathogens:Adenovirus, Papillomavirus, and T-lymphotrophic virus [137, 228]. The direct interactionof Papillomavirus proteins with human DLG1 has been implicated in development of HPV-related cancer [126].

Our analysis also identifies a network of human proteins enriched with the GO function“transcription regulator activity” (p-value 3.22×10−7, rank 15/89). The portion of Figure 3.4corresponding to the G1 phase includes the transcription factors E2F1, E2F4, and TAF1.Each of these proteins plays a key role in normal cell cycle progression from G1 to S phase.E2F1 and E2F2 interact with two pathogens, HIV and Papillomavirus [7, 106, 130]. TAF1interacts with three pathogens, Adenovirus, HIV, and Papillomavirus [39, 78, 122]. Byblocking the interaction of RB1 and various transcription factors, viral pathogens are ableto prevent the cell from advancing into the S phase. This event extends the G1 phase of thecell cycle and allows the transcription of viral genetic material.

Regulate apoptosis An important step in viral pathogenesis is the regulation of host cellapoptosis. During the initial process of infection, prevention of apoptosis is important toallow the replication of viral genetic material. However, promotion of apoptosis has beenimplicated in the progression of infection. Our results underscore both phenomena. Severalhost proteins involved in the control of cellular apoptosis are targeted by viral pathogens(human proteins highlighted in red in Figure 3.4). One of the key regulators of apoptosis,and perhaps the most studied human protein, is TP53. TP53 interacts with seven viralpathogens: Adenovirus, Hepatitis, HIV, Papillomavirus, Polyomavirus, Sarcoma virus, andSimian virus [41, 117, 136, 148, 150, 180, 189, 212]. Interactions with Adenovirus, Hep-atitis, and Papillomavirus are responsible for preventing apoptosis of the infected humancell. Adenovirus E1B and E4 proteins bind with and inactivate TP53 [58, 185]. The humanSurvivin protein is an apoptosis inhibitor that is repressed by TP53 [100]. The repressionof Survivin is necessary for the human cell to activate apoptotic programming. Anotherstudy shows that the HIV VPR protein can directly up-regulate the human Survivin pro-tein [252]. These studies suggest a common mechanism for viral inhibition of apoptosis ofthe host cell. TP53 interacts with a number of Hepatitis proteins including the Core pro-tein; Core has been shown to augment TP53’s transcriptional activity during infection topromote production of viral proteins and deregulate cell cycle check point controls and blockTP53-mediated apoptosis [170, 237]. Papillomavirus VE6 interacts with human TP53 topromote degradation of TP53 and prevent apoptotic programming of the infected cell [210].

31

In contrast to these phenomena, the viral HIV protein TAT has been shown to assist in theprogression of HIV infection by attaching to uninfected host T cells and triggering cell deathvia apoptosis [37, 139].

Transport viral material across the nuclear membrane Since viruses lack the ma-chinery needed to replicate their genomes, viral genetic material must first cross the barrierfrom the cytoplasm into the nucleus in order to make use of the host’s transcriptional ma-chinery. Our analysis identifies a subset of human proteins enriched in four GO functionsrelated to this important step: “nuclear transport” (p-value 2.32 × 10−5, rank 24/89), “nu-clear membrane part” (p-value 5.61 × 10−5, rank 28/89), “protein import” (p-value 0.049,rank 41/89), and “nuclear pore” (p-value 0.018, rank 69/89). Figure 3.5 displays this net-work. The layout in Figure 3.1 displays these proteins both in the region labeled “cytoplasm”and in the region labeled “nucleus.”

The nuclear pore is a large protein complex that spans the nuclear membrane and allows forthe transport of molecules across the nuclear envelope including proteins and RNA. Thereare ten human proteins that are part of the nuclear pore and targeted by multiple pathogens.These are the nodes containing a red section in Figure 3.5. Although smaller molecules mayfreely pass through the nuclear pores of the nuclear envelope, larger macromolecules requirethe assistance of karyopherins. Karyopherins may act as importins or exportins. Karyo-pherins bind to their cargo; after they cross the nuclear envelope, an interaction with thehuman RAN protein releases the bound partner. Figure 3.5 contains five human karyopherinproteins (KPNA1, KPNA2, KPNB1, RANBP5, TNPO1) as well as the human RAN pro-tein, which interacts with five pathogens: Adenovirus, HIV, Influenza, Papillomavirus, andSarcoma virus [56, 117]. The human protein KPNB1 interacts with four pathogens: HIV,Papillomavirus, Influenza, and Simian virus [54, 117, 164, 231]. In the case of HIV, one of theinteracting partners of the human KPNB1 protein is REV. KPNB1 binds and mediates thenuclear import of the HIV REV protein. Once inside the nucleus, REV binds to unsplicedviral mRNA and exports it from the nucleus to be translated [95]. REV is able to movebetween the nucleus and cytoplasm because it contains both a nuclear localization signaland a nuclear export signal. The human RANBP5 protein interacts with three pathogens:HIV, Hepatitis, and Papillomavirus [47, 66, 166]. The Hepatitis interactor for RANBP5 isthe viral 5A protein. While little is known about the RANBP5 protein, studies suggest thatthe viral 5A protein may interact with RANBP5 and block secretion of cytokines producedin response to a viral infection [47]. This network highlights the ability of viral pathogensto make use of host machinery in order to translate their own genetic material and at thesame time prevent the activation of a host immune response.

32

RANBP5

HIV

Hepatitis

Papillomavirus

RAE1

HIV

Stomatitis virus

KPNB1

HIV

Influenza

Papillomavirus

Sarcoma virusKPNA1

EBV

HIV

Influenza

Adenovirus

HIV

Influenza

Papillomavirus

Sarcoma virus

RAN TNPO1

HIV

Papillomavirus

NUP98

HIV

Stomatitis virus

XPO1

HIV

Influenza

SUMO1

EBV

Papillomavirus

Puumala virus

SARS

Tula virus

Vaccinia virus

KPNA2

Adenovirus

EBV

HIV

Papillomavirus

Sarcoma virus

NCBP2

HIV

InfluenzaSTAT1

Adenovirus

HIV

Hepatitis

NCKIPSD

EBV

Sarcoma virus

NCBP1

HIV

Influenza

KHDRBS1

HIV

Sarcoma virus

HNRPA1

HIV

Influenza

SARS

NPM1

HIV

Hepatitis

Herpesvirus

LMNB1

EBV

HIV

Figure 3.5: Enriched network of human proteins annotated with “nuclear transport” (blue),“nuclear membrane part” (green), “protein import” (orange), and “nuclear pore” (red). Anedge connecting two proteins denotes a known interaction in the human PPI network.

Human proteins targeted by multiple bacterial pathogens

Although the number of human-bacteria PPIs gathered in this study is small (only 174),our methods identify an important subset of human proteins enriched for functions involvedin immune response and interacting with multiple bacterial pathogen groups. Figure 3.6displays a subset of the multibacterial set that is enriched in four GO functions: “immunesystem process” (p-value 1.397 × 10−9, rank 1/28), “response to wounding” (p-value 3.93 ×10−4, rank 8/28), “immune response” (p-value 0.002, rank 14/28), and “I-kappaB kinase/NF-κB cascade” (p-value 0.012, rank 18/28). The proteins contained in this image are locatedin the top-right corner of Figure 3.2.

33

IGHG1

Staphylococcus

Streptococcus

HLA-DRA

Mycoplasma

Staphylococcus

TLR10

Chlamydia

E. coli

Neisseria

IRAK1

Chlamydia

E. coli

Neisseria

LY96

Chlamydia

E. coli

Neisseria

TRAF6

Chlamydia

E. coli

Neisseria

CD14

Chlamydia

E. coli

Neisseria

TLR4

Chlamydia

E. coli

Neisseria

TLR6

Chlamydia

E. coli

Neisseria

TLR8

Chlamydia

E. coli

Neisseria

TLR9

Chlamydia

E. coli

Neisseria

TLR7

Chlamydia

E. coli

Neisseria

TLR2

Chlamydia

E. coli

Neisseria

TIRAP

Chlamydia

E. coli

Neisseria

CDC42

Salmonella

Yersinia

TNFRSF1A

Chlamydia

Staphylococcus

RAC1

Clostridium

Pseudomonas

Salmonella

Yersinia

Figure 3.6: Enriched network of human proteins annotated with “immune system pro-cess” (red), “response to wounding” (orange), “immune response” (green), and ”I-kappaBkinase/NF-κB cascade” (blue). The proteins in the black box form a dense network of PPIs;we have left these edges out for clarity. An edge connecting two proteins denotes a knowninteraction in the human PPI network.

These functions are tied together by the Toll-Like Receptors (TLRs) and the protein IRAK1found in the network in Figure 3.6. TLRs are a special class of cell-surface proteins that playa role in recognizing the presence of a pathogen and activating an immune response againstthe pathogen. The TLR/IRAK complex stimulates the activity of NF-κB [127, 156, 241],a complex of proteins that act as a TF for activating the production of a set of proteins inresponse to stimuli such as stress, cytokines, and bacterial or viral antigens.

The human TLRs and IRAK1 protein interact with the pathogen proteins FLIC (E. coli),HSP60 (Chlamydia), and PIB (Neisseria) [117]. FLIC is a flagellin protein. TLR4 and TLR5contain a specific innate immune receptor for recognizing bacterial flagella [93, 160]. HSP60is a heat shock protein that stimulates an immune response via TLR2 and TLR4 [52]. PIB isan outer membrane protein that is known to be recognized by TLR2, TLR4, and TLR9 [161].

Another human protein included in this network is HLA-DRA, which is part of the majorhistocompatibility complex (MHC). The MHC plays an important role in the immune system.HLA-DRA belongs to the class II MHC; proteins in this class belong to the lysosomal

34

compartment of the cell, which contains digestive enzymes that kill engulfed foreign particlessuch as viruses or bacteria. The two bacterial partners for HLA-DRA are Mycoplasmaand Staphylococcus [177, 250]. In the case of Mycoplasma, the interacting partner is theMAM superantigen, which is known to contribute to autoimmune disease by activatingproinflammatory monokines such as interleukin 1β and the tumor necrosis factor alpha [1].

Other highly targeted human proteins

The networks in Figures 3.1 and 3.2 contain a number of other human proteins interactingwith more than two pathogen groups. We discuss two of these proteins—STAT1 and EP300.

Viral pathogens also interact with other human proteins involved in immune response path-ways that are not included in the network in Figure 3.6. An example is the human proteinSTAT1. When the cell recognizes the presence of foreign material, it activates an immuneresponse as a defense mechanism to either remove the foreign material or cause the cell toundergo apoptosis. During this process, STAT1 is tyrosine- and serine-phosphorylated andforms a homodimer known as IFN-γ-activated factor (GAF). GAF migrates to the nucleuswhere it binds to specific cis-elements to drive the cell to produce interferons, agents thatinhibit viral replication within other cells of the body [61]. STAT1 interacts with Aden-ovirus, HIV, and Hepatitis [110, 144, 149]. Hepatitis POLG is part of the pathogen corecomplex that allows the virus to avert host anti-viral response by binding to host STAT1and inhibiting its activity [27].

Within the nucleus, we see pathogens interact with the human protein EP300, a histoneacetyltransferase that regulates transcription via chromatin remodeling. EP300 interactswith Adenovirus, HIV, Papillomavirus, and Polyomavirus [44, 59, 162, 173]. The pathogenAdenovirus targets human EP300 via E1A. E1A is an oncoprotein that stimulates cell growthand inhibits differentiation by binding to the EP300/CBP complex and deregulating cellu-lar transcription programs [133]. Papillomavirus protein VE7 shares many functional andstructural similarities with E1A and is an interacting partner of human EP300. The disrup-tion of normal growth conditions brought about by the E1A-EP300 interaction leads to thedevelopment of cervical cancer [24]. In the case of HIV, the viral TAT protein targets humanEP300. The resulting complex regulates TAT transactivating activity and may assist in theintegration of viral genetic material into human DNA [235].

3.3 Conclusions

We have provided a general overview of the landscape of human proteins interacting withpathogens, and demonstrated that pathogens preferentially interact with two classes of hu-man proteins: hubs (i.e., proteins that interact with many other human proteins) and bottle-necks (i.e., proteins that lie on many shortest paths) in the human PPI network. We identified

35

GO functions over-represented in human proteins interacting with pathogens. Biclusteringanalysis demonstrated that many sets of pathogen groups target the same processes in thehuman cell, even if they interact with different proteins.

We constructed networks of PPIs between human proteins that interact with at least twoviral pathogen groups and with at least two bacterial pathogen group. Consideration of theGO functions enriched in these networks provided insights into numerous pathways targetedor triggered by multiple pathogens: control and deregulation of the cell cycle; import ofpathogen proteins into the nucleus in an attempt to subvert the host’s DNA replication andtranscription machinery; manipulation of host cellular programs such as apoptosis; immuneresponse and activation of NF-κB pathways via the TLR/IRAK complex.

A striking aspect of this network is that human proteins that mediate pathogen effects areoften proteins in cancer pathways (e.g., RB1, TP53, and STAT1). We note that only someof the pathogens targeting such proteins are known to cause cancer themselves (e.g., Her-pesvirus and Papillomavirus). In fact, a number of parallels are becoming evident betweeninfection and cancer, for instance, in the part that TLRs play in angiogenesis and theirpotential as targets for therapeutics [51, 155] and the role that viruses may play in the de-velopment of inflammatory diseases and cancer [208]. Cell cycle regulators and many TFshave been extensively studied in the context of mediating tumor formation. Our observa-tion that they are also communication vehicles for pathogens suggests that the link betweenpathogen infection and cancer may be worthy of further experimental studies.

We reiterate that our results should be interpreted with caution since no single pathogenmay target all the proteins we analyze. As interactions between host and pathogen moleculesare discovered on genome-wide scales [36], computational analyses such as those presentedin this paper may provide a more detailed understanding of the landscape of host pathwaysand processes that pathogens target.

Chapter 4

The Human-Pathogen PPI Networksof Three Bacterial Pathogens:Bacillus anthracis, Francisella

tularensis, and Yersinia pestis

Bacillus anthracis , Francisella tularensis , and Yersinia pestis are pathogenic bacteria thatare extremely infectious to humans [28, 67, 143] and have been listed as priority agents forpossible use as biological weapons. However, the interactions between these bacterial pro-teins and the host remain poorly characterized leading to an incomplete understanding oftheir pathogenesis and possible mechanisms of immune evasion. In this chapter, we discussthe experimental identification and computational analysis of each of these three pathogensusing a high-throughput yeast two-hybrid assay. From more than 250,000 screens performed,we identify 3,073 human-B. anthracis , 1,383 human-F. tularensis , and 4,059 human-Y. pestis

protein-protein interactions (PPIs); these interactions involved 304 B. anthracis , 52 F. tu-

larensis , and 330 Y. pestis proteins that are uncharacterized. Computational analysis of thehuman PPI network features of human proteins interacting with these pathogens demon-strates that B. anthracis , F. tularensis , and Y. pestis proteins preferentially interact withhuman proteins that are hubs and bottlenecks in the human PPI network. In addition,we compute modules of human-pathogen PPIs that are conserved amongst the three net-works. Such conserved modules reveal commonalities between how the different pathogenspreferentially manipulate crucial host pathways involved in inflammation, immunity, andapoptosis These data represent the first extensive protein interaction networks constructedfor bacterial pathogens and their human hosts. These data will provide valuable insights intothe rational design of novel, broad-spectrum vaccines and immunotherapeutics for infectiousdisease prevention and biodefense.

36

37

Bacterial genomic DNA

(B. anthracis, F. tularensis, or Y. pestis)

Fragment genomic DNA

by sonication and CviJI

restriction enzyme

Recombine into ORF

selection plasmid

GAL4-DBDPathogen genetrp1

Human activation library

GAL2 - ADE2Human geneGAL1 - HIS3GAL7 - LAC2

(Binding Library)

Liquid Mating

Construct Human-Pathogen

Networks

Sequence

Positive Clones

Human - B. anthracis

Human - Y. pestis

Identify Conserved Networks

Homologous proteins

Analysis

T T C T A A C T T C Human proteinB. anthracis proteinF. tularensis proteinY. pestis proteinObserved Human - Pathogen PPI

Human Hub and Bottleneck Network Analysis

Human - F. tularensis

No pathogensB. anthracis onlyF. tularensis onlyY. pestis onlyMultiple pathogensAll pathogensKnown Human - Human PPI

Interacting Partners

Figure 4.1: Overview of the experimental pipeline used to identify and study the human-bacteria protein-protein interactions.

4.1 Methods

4.1.1 Experimental methods

We use a random yeast two-hybrid approach to identify physical interactions between humanproteins and pathogen proteins. See Figure 4.1 for an overview of the experimental analyticalprocesses used in this study.

38

Vectors and strains

The two-hybrid vectors that we use for the random two-hybrid process are based on theSaccharomyces cerevisiae Gal4p DNA-binding domain (amino acids 1 to 147 for DBD con-structs) and transcriptional activation domain (amino acids 768 to 881 for activation domainlibraries). Both vectors have elements suitable for growth in both bacterial and yeast cells.Two DNA binding domain (DBD) cloning strategies are used that differ in the open read-ing frame (ORF) selection strategy. The DBD fusion vector pOBD.109 has a marker forselection of tryptophan prototrophy (TRP1) and kanamycin resistance. The second DBDis first cloned into the vector pOBD.111 where ORFs are selected using MET2. We thenPCR amplify all ORFs and clone them into the fusion vector Super B DBD. The ActivationDomain (AD) fusion vector pGAD.PN2 has selection for leucine prototrophy (LEU2) andampicillin resistance (ampR). In both vectors, expression of the fusion proteins is constitu-tively driven by the ADH1 promoter. Both vectors also contain centromeric sequences thatserve to stably maintain the plasmids and keep the copy number to one or two copies per cell.For the random two-hybrid experiments, we use a proprietary DNA-binding domain vectorthat permits the selection of inserts containing open reading frames (pOBD.111). This selec-tion is achieved by inserting a MET2 selectable marker in-frame and downstream of Gal4pDNA-binding domain and the cloning site. In the absence of selection for an in-frame openreading frame (ORF), the majority of inserts will be from non-coding regions or will be outof frame, and therefore of no utility in a two-hybrid assay. Using this ORF selection strategy,greater than 80% of the cloned inserts in these vectors contain open reading frames afternutritional selection. The DNA-binding domain vectors we use, pOBD.111 and pOBD.109,are slightly modified to facilitate the cloning of bacterial genomic DNA fragments that havehad linkers added to their ends. The haploid yeast strain used to express the DNA-bindingdomain fusions, PNY200, has the following genotype: MATα trp1-901 leu2-3,112 ura3-52his3-200 ade2 gal4 gal80. The haploid yeast strain used to express the activation domainfusions, PJ69-4A1, has the following genotype: MATα trp1-901 leu2-3,112 ura3-52 his3-200ade2 gal4 gal80 GAL2-ADE2 LYS2::GAL1-HIS3 met2::GAL7-lacZ. The two yeast strainsare derived from the same parent cell line and display high mating efficiencies. Both allowfor the introduction and selection of vectors carrying the yeast selectable markers TRP1,LEU2, and URA3. The activation domain strain contains three different Gal4p-responsivereporter genes: GAL2-ADE2 and GAL1-HIS3, which are assayed by selection on yeast syn-thetic media lacking either adenine or histidine, respectively, and GAL7-lacZ, which can bemonitored using colorimetric or luminescent assays for beta-galactosidase activity. The HIS3reporter exhibits a low level of background His3p expression that can be counteracted by useof 3mM 3-amino-1,2,4-triazole, a competitive inhibitor of the His3p protein. These markersare unrelated except for the small GAL4 binding sites in their promoters. Since it is veryunlikely that all three genes would be spuriously activated if their promoters are so distinct,the likelihood of false-positives is reduced.

39

Generation of DNA-binding domain libraries

Fragments of Bacillus anthracis , Francisella tularensis , and Yersinia pestis genomic DNAare cloned into DNA-binding domain vectors pOBD.111 and pOBD.109 to create libraries fortwo-hybrid analysis. The genomic DNA was obtained from the laboratories of Dr. KennethBradley (University of California, Los Angeles), Dr. Martha Furie (Stony Brook University),and Dr. James Bliska (Stony Brook University) respectively. Bacterial genomic DNA insertpreparation involves the mechanical (sonication) and enzymatic (cviJI**) shearing to producerandom fragments of an average size of 500 bp. We blunt single-stranded overhangs to recoverfragments of desired size. We then ligate purified fragments to linkers and co-transform theminto bacterial cells with an equimolar amount of linearized vector. We them transform theentire ligation and plate onto selection plates for amplification. We pool colonies and isolateplasmid DNA for transformation into yeast.

Preparation of DNA-binding domain fusions

In order to randomly screen each DBD library an aliquot of the DNA-binding domain library,we plate yeast on yeast synthetic media lacking tryptophan at a density that allows theselection of individual yeast colonies. After a three to four day incubation, we pick individualyeast clones into a 96 well plate containing yeast rich media (YPD). We then incubate theplate at 30◦C for one to two days to permit the growth of a sufficient quantity of DNA-bindingdomain yeast.

Random yeast two-hybrid screens

We generate DNA-binding domain libraries in a haploid MATα strain and the human spleenactivation domain library in a MATα strain. Each haploid yeast culture containing a singleDNA-binding domain fusion is mated to generate diploid yeast cells that express both theDNA-binding and activation domain fusions. In a departure from the directed approach,we use a liquid-format mating strategy in a 96-well plate (as opposed to mating on filtersor agar), thus allowing for the generation of greater than 500,000 diploids (and, therefore,protein combinations). Two-hybrid positives are selected on yeast minimal media lackingtryptophan and leucine (to select for mating events), and lacking histidine and adenine (toselect for activation of the two-hybrid reporter genes).

The goal is to analyze the vast majority of B. anthracis , F. tularensis , and Y. pestis proteinsas DNA-binding domain fusions. The DNA-binding domain libraries contain fragments sizesselected to be larger than 300 bp and with the average insert size of 500 bp (167 aminoacids). We choose the 300 bp minimum because many recognizable protein domains are inthis size range; in addition, this size of fragment works well in yeast two-hybrid assays.

We generate comprehensive protein interaction maps by performing a ten-fold coverage of

40

the coding capacity of each of the pathogen genomes. The number of screens is calculatedby dividing the total genomic sequence of the pathogen by the average fragment size in theDNA-binding domain library (500 bp) and multiplying by ten (fold coverage).

Analysis of positive screens

We incubate the yeast selection plates for ten days and then they are ready for analysis.We experience three different outcomes: 1) Some plates exhibit no growth of yeast coloniesand are discarded without further analysis; 2) Some plates exhibit growth of a very largenumber of colonies (from hundreds of yeast colonies to a lawn of yeast); 3) The remainingplates contain a modest number of colonies, from one to a few hundred. In the first scenariowhere there are no colonies returned, we assume that there are no detectable interactors forthose protein fragments. In the second case where a very large number of colonies are found,it is likely that the DNA-binding domain fusions possess inherent self-activation ability andare not worthy of further investigation, as they do not represent protein interaction pairs.After analyzing many thousands of searches, it is our experience that DNA-binding domainfusions yielding in excess of 100 colonies per search are likely self-activators. Typically,our ORF-selected DNA-binding domain libraries contain two to five percent self-activatingbaits, in agreement with the frequencies observed for random fragments of Escherichia coliand bacteriophage T7.5 [16, 152].

We select colonies for further analysis and transfer them to fresh media. We amplify boththe DNA-binding and activation domain inserts by PCR and sequence the resulting prod-ucts using dye-primer chemistry on capillary instruments. We use the resulting sequenceinformation to identify the interacting protein fragments.

Filtering positive interactions

We retain interactions for positive colonies in which the insert is in the correct orientation,contains one but no more than two annotated genes, and does not contain multiple genomicfragments that have been ligated together.

4.1.2 Computational methods

Notation

We represent each experimentally derived human-bacterium network as a bipartite graph B =(H,P, I), where H is the set of human proteins, P is the set of proteins in the bacterium,and I is a set of edges (interactions) each of which connects one protein in H to a proteinin P . Further, we represent the set of known intra-species (human) protein-protein inter-

41

actions as an undirected graph G = (V,E), where V is the set of nodes (human proteins)and E is the set of edges (interactions). We now describe in detail the tests we use to analyzeeach of the human-pathogen networks.

Analysis of degree in the human PPI network

The degree of a protein in a graph is the number of interactions in which it participates.Proteins involved in many interactions are often referred to as hub proteins. We plot thedegree distributions for six sets of human proteins: (i) the set of all human proteins in thehuman PPI network, (ii)–(iv) three sets of human proteins contained within each of thehuman-bacteria networks, (v) the set of human proteins found to interact with multiplepathogens, and (vi) the set of human proteins found to interact with all human pathogens(B. anthracis , F. tularensis , and Y. pestis). A bias towards high-degree proteins in the lastfive distributions would suggest that B. anthracis , F. tularensis , Y. pestis have evolved tointeract with higher degree proteins in the human PPI network.

Analysis of betweenness centrality in the human PPI network

The degree of a protein captures only its local connectivity. Betweenness centrality (BC)measures capture both global and local features of a protein’s importance in a network [74].A protein with high betweenness centrality is characteristic of a bottleneck in an interactionnetwork (i.e., there are many paths which pass through this protein) [243]. The betweenness

centrality for a protein v ∈ V is defined as the fraction of shortest paths in G between allprotein pairs (u, w) that pass through the protein v. Given u, v, w ∈ V , let σuw denotethe number of shortest paths between proteins u and w. Note that there may be multipleequally long paths between u and w that are shorter than any other path between u and w.Further let σuw(v) denote the number of these shortest paths that pass through v. Then thebetweenness centrality of v is

BC(v) =∑

u,w∈Vu,w 6=v

σuw(v)

σuw

To compute the betweenness centrality for each protein in G, we use the algorithm devisedby Brandes [32]. This algorithm runs in time proportional to the product of the number ofnodes in G and the number of edges in G. We plot distributions for the same six sets as inthe degree analysis. Again, if the distribution for the last five sets is biased to higher valuesof centrality than the distribution for the first set, we could hypothesize that B. anthracis ,F. tularensis , and Y. pestis have evolved to interact with proteins with higher betweennesscentrality in the human PPI network.

42

Gene set enrichment analysis

We use Gene Set Enrichment Analysis (GSEA) to determine if the human proteins interact-ing with B. anthracis , F. tularensis , and/or Y. pestis have significantly higher degree andbetweenness centrality than randomly picked proteins in G [219]. Let L be a ranked list ofproteins in G, where we rank the human proteins by degree or by betweenness centrality.Given L and a predefined set S of proteins of interest (e.g., those interacting with Fran-

cisella), GSEA determines whether the proteins contained in S are randomly distributedthroughout L or concentrated at the top. We compute the enrichment score es(S, L) as fol-lows. In the ranked list L, let li be the value (of degree or centrality) at index i, 1 ≤ i ≤ |L|.We abuse notation and say that an index i is an element of S if the protein whose rank isi belongs to S. We compute m =

∑

i∈L li, the total of all the values in L. Next, for eachindex i in L, we compute two values:

phit(S, i) =∑

j∈S,j≤i

ljm

pmiss(S, i) =∑

j /∈S,j≤i

1

|L| − |S|

Thus, phit(S, i) is the fraction of proteins with index at most i in L that are in S andpmiss(S, i) is the fraction of proteins with index at most i in L that are not in S. We handlemultiple ranks with identical values by computing these two values only at the largest rankfor each unique value in L. Finally, we define the enrichment score as the largest positivevalue of Phit(S, i) − Pmiss(S, i), i.e.,

es(S, L) = max1≤i≤|L|

(

max(

Phit(S, i) − Pmiss(S, i), 0)

)

A large positive value of es(S, L) identifies a set of human proteins interacting with B. an-

thracis , F. tularensis , and/or Y. pestis proteins that have high degree or high betweennesscentrality. To compute p-values for an observed enrichment score s, we generate a nulldistribution of scores by repeatedly selecting |S| random nodes in L and computing thescore for each random subset of nodes. We repeat this process 1,000,000 times and estimatethe p-value for s as the fraction of random sets whose score is at least as large as s.

Identifying paralogous and orthologous protein pairs

In preparation for computing conserved protein interaction modules, we computed orthol-ogous pairs of proteins in every pair of pathogens. We used Inparanoid [190] with default

43

parameters to define orthologous pairs of proteins. In addition, we used OrthoMCL [141]with default parameters to identify paralogous pairs of human proteins. For the sake of con-venience, we considered a human protein appearing in one human-pathogen PPI network tobe paralogous to a copy of the same protein appearing in another human-pathogen network.

Conserved human-pathogen PPI modules

Given a pair of human-pathogen PPI networks B1 and B2, let H be the set of edges cor-responding to the orthologous and paralogous pairs of proteins between B1 and B2, ascomputed above. We consider all edges to have weight 1. We define a Conserved Protein

Interaction Module (CPIM) to be a triple (T1, T2, O) where T1 and T2 are connected sub-graphs of B1 and B2, respectively, and O ⊆ H such that (a, b) ∈ O if and only if a is a nodein B1 and b is a node in B2. Thus, O is the subgraph of H induced by the nodes in T1 andT2. We use two measures of quality for a CPIM: conservation score and interaction score.

We define the interaction score of a CPIM (T1, T2, O) to be simply the total number ofhost-pathogen PPIs in B1 and B2 and denote this score by q(T1, T2, O). The conservationscore of a CPIM (T1, T2, O) measures the amount of evolutionary similarity (at the amino acidlevel) between the human-pathogen interaction networks T1 and T2. Let P1 (respectively, P2)be the sets of nodes (both human and pathogen) in B1 (respectively, B2). We define theconservation score of the CPIM (T1, T2, O) as

φ(B1, B2, O) =|O|

|P1| + |P2|,

i.e., the total number of orthologous or paralogous pairs of nodes in the CPIM divided by thetotal number of nodes in the CPIM. The larger this score, the more evolutionary conservedwe consider T1 and T2 to be, since there are fewer proteins without orthologs or paralogs inthe CPIM. Note that if we are given T1 and T2, we can maximize this score by making Othe subgraph of H induced by P1 and P2.

4.1.3 The GraphHopper Algorithm

We use the GraphHopper [191] algorithm to compute CPIMs. Given two human-pathogenPPI networks B1 and B2, GraphHopper finds CPIMs by “hopping” from one network to an-other using orthology and paralogy relationships. Note that GraphHopper does not considerPPIs between human proteins. Solely for the sake of completeness, we describe a simplified

version of the algorithm here; we stress that the GraphHopper algorithm is not a contribution

of this dissertation. GraphHopper attempts to find CPIMs with high conservation and lowinteraction score. At a high level, the algorithm starts with a connected basis CPIM thatcontains four nodes and edges. Iteratively, the algorithm “hops” from one PPI network toanother. In each iteration, GraphHopper expands the CPIM to increase the conservation

44

(a) (b) (c)

Figure 4.2: An illustration of how GraphHopper expands a CPIM in iteration k. (a) A CPIMat the end of iteration k − 1. (b) In iteration k, GraphHopper keeps the network in left sideof the CPIM fixed and expands the network in the right side of the CPIM. The two nodesmarked by arrows belong to the set P computed in Step (i) of the algorithm. The node v′

found in Step (iii) is the lower of these two nodes. In Steps (iv) and (v), GraphHopper addsthe thick magenta interactions and orthology edges to the red network in the CPIM. (c) TheCPIM at the end of iteration k.

score, while attempting to keep the quality score as little as possible. We now provide de-tails about the algorithm. Our inputs are two human pathogen protein interaction networksB1 = (V1, E1) and B2 = (V2, E2) and a set of orthologous or paralogous protein pairs

4.1.4 Computing basis CPIMs.

We start by constructing a basis set of CPIMs in which each CPIM (T1, T2, O) has thefollowing properties:

(i) O contains two edges (a, a′) ∈ H and (b, b′) ∈ H;

(ii) a and b are connected by an intermediate protein in B1; and

(iii) a′ and b′ are connected by an intermediate protein in B2.

Thus, each basis CPIM consists of four host-pathogen PPIs (two each in T1 and in T2) andtwo orthology or paralogy edges. The basis set consists of all such CPIMs.

4.1.5 Expanding a basis CPIM.

GraphHopper processes each CPIM in the basis set using the following iterative algorithm.Let (T 1

1 , T 12 , O1) be a basis CPIM. In iteration k > 1, we construct a CPIM (T k

1 , T k2 , Ok) such

that (T k−11 , T k−1

2 , Ok−1) is a subgraph of (T k1 , T k

2 , Ok) and φ(T k1 , T k

2 , Ok) > φ(T k−11 , T k−1

2 , Ok−1),

45

i.e., the new CPIM has a higher conservation score. Further, we attempt to keep q(T k1 , T k

2 , Ok)−q(T k−1

1 , T k−12 , Ok−1) as small as possible, i.e., the new CPIM has as few PPIs added to it

as possible. We keep either T k−11 or T k−1

2 fixed and “expand” the other graph. Withoutloss of generality, we assume that T k

1 = T k−11 and T k−1

2 is a subgraph of T k2 in the following

discussion. We construct (T k1 , T k

2 , Ok) using the following steps:

(i) We identify a set P ⊆ V2 of nodes such that each node v ∈ P is not a node in T k−12

and is connected by an edge in H to at least one node in T k1 .

(ii) For each node v ∈ P , we use breadth-first search to compute the shortest path πv inB2 that connects v to T k−1

2 , i.e., for each node u ∈ T k−12 , we compute the shortest path

between u and v in B2, and set πv to be the shortest of these paths.

(iii) We find the node v′ in P such that πv′ is the lightest among all paths computed in theprevious step.

(iv) We set T k2 to be the union of T k−1

2 and πv′ .

(v) We set Ok to be the union of Ok−1 and the set of edges in H incident on v′ and a nodein T k

1 .

(vi) We compute φ(T k1 , T k

2 , Ok). If φ(T k1 , T k

2 , Ok) > φ(T k−11 , T k−1

2 , Ok−1), we go to Step (i)and expand (T k

1 , T k2 , Ok) while keeping T k

2 fixed. Otherwise, we stop expanding thisCPIM and proceed to the next basis CPIM.

The rationale for these steps is as follows. To expand the CPIM (T k−11 , T k−1

2 , Ok−1) aftersetting T k

1 = T k−11 , we first identify the set P of nodes in B2 that do not belong to T k−1

2 butare orthologs of nodes in T k

1 (Step (i)). Each node in P is a candidate that we can add toT k−1

2 in order to construct T k2 . However, such a node v ∈ P may not be adjacent to any node

in T k−12 . Since our goal is to keep q(T k

1 , T k2 , Ok) − q(T k−1

1 , T k−12 , Ok−1) as small as possible,

we would like to connect v to T k−12 using the fewest edges in B2. A natural candidate for

this set of edges is the shortest path πv connecting v to T k−12 , where this minimum is taken

over the set of shortest paths connecting v to each node in T k−12 . Therefore, for each node

v in P , we compute the shortest path πv by which we can connect v to T k−12 using only

edges in B2 (Step (ii)). In Steps (iii) and (iv), we add that path π′v to T k−1

2 that is shortestamong all the paths computed i.e., v′ = arg minv∈P |πv|. After computing T k

2 , we set Ok tobe the subgraph of H induced by the nodes in T k

1 and T k2 by adding the edges in H that

are incident on v′ and any node in T k1 (Step (v)); by construction, no node in π′

v other thanv′ is connected by an edge in H to a node in T k

1 . This step completes the construction of(T k

1 , T k2 , Ok). Finally, in Step vi, we continue expanding (T k

1 , T k2 , Ok) if its conservation score

is greater than φ(T k−11 , T k−1

2 , Ok−1). Otherwise, we stop the iteration and move on to thenext basis CPIM. By induction, the graphs T k

1 , T k2 and T k

1 ∪ T k2 ∪ Ok are connected. Note

that q(T k1 , T k

2 , Ok) implicitly plays a role in the expansion: by choosing to add the shortestpath πv′ to T k

2 , we are attempting to minimize q(T k1 , T k

2 , Ok) − q(T k−11 , T k−1

2 , Ok−1).

46

4.1.6 Assessing the statistical significance of a CPIM.

We compute the statistical significance of a CPIM using standard methods [204]. We com-pute two random PPI networks with the same degree distribution as B1 and B2 and arandom network connecting nodes in B1 to nodes in B2 with the same degree distributionas H. We compute a histogram of the conservation scores of all CPIMs that GraphHopperfinds in these networks. We amalgamate histograms over 100 random inputs and estimatethe p-value of a CPIM (T1, T2, O) as the fraction of CPIMs in random networks whose con-servation score is at least as large as φ(T1, T2, O). We retain CPIMs that have p-value atmost 0.05.

4.1.7 CPIM Functional Enrichment

For each CPIM we compute enriched Gene Ontology (GO) [10] functions for five sets ofproteins: the set of human proteins interacting with pathogen 1, the set of human proteinsinteracting with pathogen 2, the union of human proteins in the CPIM, and the two setsof pathogen proteins in the CPIM. Considering a set of proteins S, e.g., those interactingwith pathogen 1, we compute enriched functions as follows. For every function f in GO,let sf be the count of proteins in P annotated with f . Let uf be the count of proteins inthe universe U annotated with f . We define the universe to be the set of all human proteinspreviously identified in the activation library and for pathogens the universe is the set ofpathogen proteins found to interact with at least one human protein. With these two countswe compute the p-value of f as

pf =

min(spa(f),vf )∑

k=sf

(

sf

k

)(

|U |−uf

|S|−k

)

(

|U ||S|

)

We retain functions only for which pf ≤ 0.05 after accounting for multiple hypothesis testingusing the method of Benjamini and Hochberg [23]. Since functions in GO are specified atmultiple levels of detail, the set of enriched function pairs may contain closely related pairsof functions. We use the following criteria to collapse the enriched functions to the mostspecific and most enriched. From the set of all enriched functions, we remove a function f ifthere is another function g such that

(i) pg < pf i.e., g is more statistically significant than f , and

(ii) g is either an ancestor or a descendant of f .

Thus, we retain a function g precisely when g is more significant than all its ancestors anddescendants in GO.

47

4.1.8 Merging CPIMs

The steps described above convert each basis CPIM into an expanded CPIM with high con-servation and low interaction score. However, the expanded CPIMs may have considerableoverlap. We modify the procedure used by Sharanet al. [204] to merge CPIMs. For eachCPIM C, we compute all the biological functions it is enriched in and note the function fC

that is most enriched (has smallest p-value) in C. Let F be the set of all such most-enrichedfunctions. Finally, for each function l ∈ F , we compute a CPIM Cl as the union of all CPIMsC for which l = fC , i.e., Cl =

⋃

l=fCC. We report results for these CPIMs. Note that this

method (i) does not require us to provide a cutoff on the overlap of two CPIMs that shouldbe merged, (ii) allows merged CPIMs to share both proteins and interactions, and (iii) mayyield disconnected CPIMs.

4.1.9 Datasets used

We gathered 78,804 PPIs between human proteins from seven databases: the BiomolecularInteraction Network Database [80], the Database of Interacting Proteins [196], the HumanProtein Reference Database [159], IntAct [96], the Molecular INTeraction database [246],the Munich Information Center for Protein Sequences [86], and Reactome [117]. For someanalyses, we consider a human PPI network assembled from unbiased high-throughput ex-periments [68, 194, 214] and a network constructed from only manually curated humanPPIs [117, 159]. These networks contain 13,172 and 64,427 interactions respectively. Wealso obtained functional annotations from the Gene Ontology (GO) [10]. We gathered infor-mation for virulence factors from MVirDB [251]. These data were downloaded in February2008.


From over 240,000 screens we identify a set of 3,911, 1,942, and 5,157 PPIs for the human-B. anthracis , -F. tularensis , and -Y. pestis networks respectively. We filter this set byremoving human proteins, which interact with large number of pathogen proteins identifiedby multiple screens with other pathogens (unpublished data). Table 4.1 summarizes the finalset of interactions data used in this study. In total we identify 3,073 PPIs between 1,748human and 943 B. anthracis proteins, 1,383 PPIs between 999 human and 349 F. tularensis

proteins, and 4,059 PPIs between 2,108 human and 1,218 Y. pestis proteins. Figure 4.3display a Venn diagram of the human proteins interacting with pathogen proteins for eachof the pathogens.

48

B. anthracis F. tularensis

Y. pestis

282

133

472 157

861 426

1,197

Figure 4.3: Venn diagram of human proteins interacting with pathogen proteins for each ofthe pathogens.

4.2.1 Bacterial pathogens interact with human hubs and bottle-necks

Two recent studies [36] (also, see Chapter 3) have shown that viral proteins may have evolvedto preferentially interact with protein hubs (proteins with many interacting partners) andbottlenecks (proteins that lie in shortest paths between many pairs of proteins) in the humanPPI network. We hypothesize that bacterial proteins may also have this property; such aphenomenon may suggest selective pressure on the pathogen proteins to control and disrupt

Table 4.1: Summary of human-pathogen interactions.

Organism Observed PPIs # H. sapiens proteins # pathogen proteinsB. anthracis 3,073 1,748 943F. tularensis 1,383 999 349Y. pestis 4,059 2,108 1,218

49

complexes and pathways governing critical processes in the human cell. Our analysis ofhuman proteins interacting with bacterial proteins supports this hypothesis. Figure 4.4(a)displays a log-log plot of the degree of six sets of proteins in a human PPI network collatedfrom multiple databases [80, 86, 96, 117, 159, 196, 246]: (i) all human proteins, (ii)–(iv)the subset of proteins interacting with at least one B. anthracis , one F. tularensis , or oneY. pestis protein, (v) the subset of proteins interacting with at least two different pathogens,and (vi) the subset of proteins interacting with all three pathogens. These plots show thatacross almost the entire range of degrees, proteins interacting with these bacterial pathogenstend to have higher degree. The betweenness centrality results display the same trend (seeFigure 4.4(b)). We use Gene Set Enrichment Analysis (GSEA) [219] to test whether thegaps we observe in Figure 4.4(a) and Figure 4.4(b) between the curve for all human proteinsand the other curves are statistically significant. GSEA is a method developed to assessthe significance of the differential expression of a pre-defined gene set in two phenotypesof interest [219]. GSEA ranks all genes by a suitable measure of differential expression(e.g., the t-statistic) and uses a modified Kolmogorov-Smirnov test to assess if the genes inthe given set have surprisingly high or low ranks. Since distributions of the t-statistics ofdifferentially expressed genes have been observed to follow a power-law distribution [101],we reason that GSEA may be appropriate to test whether the human proteins interactingwith pathogensB. anthracis , F. tularensis , and Y. pestis have surprisingly high degree orbetweenness centrality. GSEA yields p-values less than 1×10−6 for both degree and centralityfor all groups (see Tables B.6 and B.7 in the appendix), supporting the conclusions we drawfrom Figures 4.4(a) and 4.4(b). Following an earlier analysis for viral-human PPIs [64](see Chapter 3), to address the possibility that the observed patterns may be artifacts ofexperimental biases or errors in the human PPI network, we repeat both analyses using twosubsets of the human PPI network: (i) interactions detected by small-scale experiments and(ii) interactions observed by large-scale studies. We obtain statistically-significant results inboth cases (see Tables B.6 and B.7 in the appendix).

4.2.2 Human proteins interacting with multiple pathogens

Next, we asked if human proteins interacting with multiple pathogens may be involved infunctions related to pathogenesis. Since manipulation of apoptotic programming in the hosthas been linked to infection by all three pathogens [132, 172, 249], we identified humanproteins that are known to be involved in apoptosis and asked which pathogen proteinsthey interact with (see Figure 4.5). The human protein CIAPIN1 is a cytokine-inducedapoptosis inhibitor. By activating this inhibitor, F. tularensis may be able to prevent thecell from activating apoptotic pathways. Other interactions include the human STK17A; aserine/threonine-protein kinase that acts as a positive regulator of apoptosis [197]. We findthat STK17A interacts with the Y. pestis yopM, an outer-membrane protein, and B. an-

thracis BAS1763, a putative aminopeptidase. We also identify interactions with the humanDAXX protein, a death-associated protein. The interactor for DAXX is the Y. pestis outer-

50

0.0001

0.001

0.01

0.1

1

1 10 20 30 40 50 70 90 140 190 240 340

Fra

ctio

n of

Pro

tein

s

Degree of Human Protein in Human PPI Network

Degree Distributions For Whole Human PPI Network

Non-pathogen interactors (9,110)BA (1,269)

FT (729)YP (1,514)

Multiple pathogens (828)All pathogens (241)

(a) Cumulative degree distribution.

0.0001

0.001

0.01

0.1

1

2.5e-08 5e-08 1e-07 2.5e-07 5e-07

Fra

ctio

n of

Pro

tein

s

Centrality of Human Protein in Human PPI Network

Centrality Distributions For Whole Human PPI Network

Non-pathogen interactors (9,110)BA (1,269)

FT (729)YP (1,514)

Multiple pathogens (828)All pathogens (241)

(b) Cumulative betweenness centrality distribution.

Figure 4.4: Cumulative log-log plots of (a) node degrees and (b) centralities for six subsets ofproteins in the human PPI network: the red curve is for the set of all proteins in the humanPPI network; the green curve is for the set interacting with B. anthracis ; the dark blue curveis for the set interacting with F. tularensis ; the purple curve is for the set interacting withY. pestis ; the light blue curve is for the set interacting with at least two pathogens; andthe orange line is for the set interacting with all three pathogens. For each set, the fractionof proteins at a particular value of degree or centrality is the number of proteins havingthat value or greater divided by the number of proteins in the set. Counts in parenthesesrepresent the number of proteins in each set.

51

Figure 4.5: Identified interactions of human proteins involved in apoptosis. We divide thehuman protein into sub-sets based on whether they induce or prevent apoptosis, or whetherthey regulate apoptosis. Proteins in the group labeled “Non-specific” do not have an an-notation more specific than “Apoptosis” in the Gene Ontology [10]. For clarity this imageshows only interactions involving virulence factors and “uncharacterized” proteins.

52

membrane protein y2359. In contrast to the interactions with CIAPIN1, these interactionsmay be involved in the promotion of apoptosis. We also identify interactions between DAXX,a death associated protein, and the Y. pestis outer-membrane protein y2359 as well as anumber interactions with RTN4, a multifunctional protein involved in tumor suppression,inhibition of neuronal regeneration, and apoptosis [238]. The pathogen interactor of RTN4is the Y. pestis lcrV, a virulence-associated protein.

Interestingly, the in-depth analysis of this network has uncovered interactions among setsof bacterial and human proteins involved in innate immunity (i.e., TLR4 and TLR7), in-flammation (IL-8RB, NF-κB and Bcl-6) and recruitment, regulatory function, maturationand activation of T cells (i.e., CXCR4, STAT3, NOTCH2, and LCK). For example, LCKis a tyrosine kinase expressed in T cells associated with the cytoplasmic tail of CD4 andCD8 co-receptors. Functionally, in addition to its role in mitochondrial apoptosis, LCK isa crucial regulator of T cell activation [227]. Of note, LCK interacts with proteins from allthree pathogens, suggesting that these bacteria may have developed a conserved mechanismof impairing effector T cell responses by disrupting LCK signaling.

CXC-chemokine receptor 4 (CXCR4) is the major co-receptor for human immunodeficiencyvirus in CD4+ T cells and a promising new target for developing anti-HIV drugs [83]. We findthat CXCR4 interacts with the yscP protein, a known virulence factor from Y. pestis . Thenatural ligand for CXCR4 is CXCL12 or SDF1 (stromal cell-derived factor-1) a chemokineinvolved in the recruitment of down-modulatory FOXP3+ regulatory T cells (Treg) into in-flamed tissues [115]. In addition, STAT3 is required for expression of FOXP3 in Treg [171].Our data demonstrate that STAT3 interacts with the Y1119 protein of Y. pestis . In turn,we show that TGF-β1, a down-modulatory cytokine produced by Treg, interacts both withY. pestis and F. tularensis proteins. The highly virulent Schu4 strain of F. tularensis hasbeen shown to suppress inammation in infected mice, and this inhibition has been attributedto induction of TGF-β [29]. Similar patterns have been observed in B. anthracis andY. pestis [88, 179]. Taken together, these interactions point towards a possible mechanismof down-modulating effector T cell responses against pathogens by targeting Treg-relatedproteins.

Table 4.2: Summary of the number of identified CPIMs for each of the algorithms used inthis study.

Algorithm BA-FT BA-YP FT-YPGraemlin [70] N/A N/A N/AMatch-and-Split [165] 3 2 0NetworkBLAST [204] 1 1 0Graph Hopper [191] 44 74 45

53

4.2.3 Conserved protein interaction modules

Encouraged by the evidence in our data suggesting that all three pathogens target proteinsinvolved in apoptotic, inflammatory and immune response pathways, we sought to performa more systematic comparative analysis of the three host-pathogen PPI networks. Similaranalyses have yielded many insights into the evolution of PPI networks of model organ-isms [202]. We hypothesized that the human-pathogen networks studied here may sharecommon sub-networks of interactions between orthologous proteins. We used the Graph-Hopper algorithm [191] to identify Conserved Protein Interaction Modules (CPIMs) amongthe three human-bacteria networks. Table 4.2 summarizes these results.

Graemlin [70] requires the user to provide the topology of expected conserved modules aspositive examples. Thus, Graemlin is not directly applicable to our scenario since there are nosuch examples available for these systems. Using the Match-and-Split [165] algorithm we areable to identify three CPIMs for the B. anthracis-F. tularensis comparison, two CPIMs forthe B. anthracis-Y. pestis comparison, and zero for the F. tularensis-Y. pestis comparison.In the case of NetworkBLAST where there are a number of user parameters that can beadjusted, e.g. complex density and false negative rates, we tried different combinationsof values. For the parameter complex density, we vary the input value from 0.50 to 0.95,adjusting values by 0.05 for each test. We performed the same procedure for testing therange of 0 to 0.80 for the FN ratios. Varying the parameters for the NetworkBLAST [204]algorithm had no effect on the identified CPIMs in our case; providing a single CPIM forboth the B. anthracis-F. tularensis and B. anthracis-Y. pestis comparisons and zero CPIMsfor the F. tularensis-Y. pestis comparison. Using the GraphHopper [191] algorithm wewere able to identify many more significant CPIMs. In total we identify 44 CPIMs forthe B. anthracis-F. tularensis comparison, 74 for the B. anthracis-Y. pestis comparison,and 45 for the F. tularensis-Y. pestis comparison. We use the Cerebral plugin [15] forCytoscape [201] to render all of our CPIM images. We focus on those CPIMs identified byGraphHopper in which the most enriched functions are related to pathogenesis and the hostresponse. We discuss two CPIMs related to the host’s immune response.

The major histocompatibility complex (MHC) proteins that are responsible for presentingantigens to T cells. Antigen processing and presentation is crucial for activating T cells andmounting protective immune responses. Our analysis captures CPIMs containing humanproteins enriched in both antigen processing and presentation functions. For example, wefind an interaction between the human HLA-B protein and the B. anthracis pagA protein.HLA-B is an MHC class I protein responsible for presenting antigen fragments to CD8+

T-cells. The pathogen pagA protein, along with the lethal factor and oedema factor, is oneof three proteins composing the anthrax toxin. Functionally, the pagA protein facilitates thetranslocation of the enzymatic toxins across the cell membrane. Also interacting with HLA-B is the Y. pestis yscP protein, which is part of the Yersinia outer-membrane protein (YOP)secretion system. Members of the YOP family have been shown to interact with MHC Iproteins in the closely related pathogen, Yersinia enterocolitica [213]. Other members of the

54

a)

b)

c)

d)

e)

f)

Node (Protein) Shapes

Human protein

Uncharacterized pathogen

proteinPathogen virulence

factor

Node (Protein) Colors

Human

B. anthracis

F. tularensis

Y. pestis

Group Colors

MHC-II

MHC-I

Inflammation

Immune response

Antigen binding

Edge Colors

PPI

Paralog

Ortholog

Figure 4.6: Conserved modules of human-pathogen PPIs involved in (a-c) antigen bindingand processing and (d-f) immune response pathways. For clarity these images show onlyinteractions involving virulence factors and “uncharacterized” proteins.

55

MHC class I family found to be interacting with B. anthracis , F. tularensis , and Y. pestis

proteins include HLA-A, HLA-C, and HLA-E. We also identify a number of interactions forhuman proteins belonging to MHC class II (i.e., HLA-DRA, HLA-DPB1, HLA-DQB1, andHLA-DMB), which are responsible for presenting antigens to CD4+ T cells. These MHCclass II proteins interact with various pathogen proteins including a number of pathogenmembrane proteins and yet uncharacterized proteins (see Figures 4.6(a)-4.6(c)).

NF-κB is a transcription factor found at the crossroads of numerous immune and inflam-matory pathways and it contributes to the induction of innate immune responses. NF-κB isfound downstream of the Toll family of receptors which participate in signaling in responseto infection. Pathogens have evolved to disrupt this critical process and therefore evadethe host response. Inhibition of the NF-κB pathway impairs both the activation and dif-ferentiation of T cells and antigen-presenting cells. In the case of Y. pestis , the inhibitionof the NF-κB pathway is necessary for rapid apoptosis in infected macrophages [248]. Wefind several members of the Y. pestis YOP family, including yscI, yscK, and yopD alongwith virulence factors such as the toxin tccC1 and the targeted effector protein kinase ypkAinteracting with NF-κB. Many of the other pathogen interactors of NF-κB are labeled as“uncharacterized” proteins. We hypothesize that the role of these interactions is to bind NF-κB, rendering it non-functional. We also observe interactions between the Y. pestis proteinsusg (an aspartate-semialdehyde dehydro-genase) and tar1 (a methyl-accepting chemotaxistransmembrane protein) and the human IKK-A protein. IKK-A phosphorylates inhibitors ofNF-κB, leading to the degradation of the inhibitor and providing an active NF-κB protein.We suggest that the interactions of usg and tar1 with IKK-A prevent the deactivation of NF-κB inhibitors, ultimately leading to the inactivation of NF-κB. We also find an interactionbetween human NF-κB-IA, a NF-κB inhibitor that binds to NF-κB and traps it in the cyto-plasm, and the Y. pestis protein y3760, a putative multi-drug resistance protein. Upstreamof NF-κB we demonstrate alr1-TLR4 and Y1119-TLR7 interactions. TLR4 and TLR7 arereceptors for lipopolysaccharide and viral single-stranded RNA, respectively. Our data alsocontains several B. anthracis and F. tularensis proteins of unknown functions that interactwith the human NF-κB protein. In sum, these host-pathogen interactions point to a commonstrategy evolved by all three pathogens consisting of targeting host proteins that are criticalfor inducing immune responses. We find that these pathogens interact with many other hu-man proteins involved in immune response pathways (see Figures 4.6(d)-4.6(f)). This strongtrend of interaction between bacterial proteins and proteins of the immune system that areboth crucial and conserved warrants the further study across a larger number of pathogens.

4.3 Conclusions

In summary, we have provided the first large-scale PPI map for three bacterial pathogensand their human host. We used a high-throughput yeast two-hybrid approach in conjunctionwith computational analysis to highlight important subsets of B. anthracis , F. tularensis ,

56

and Y. pestis proteins that have been found to interact with human proteins. Systematicscreening of human-pathogen PPIs also allows us uncover novel interactions of relevance forunderstanding pathogenesis and therapy development. In line with recent trends in drugdiscovery favoring polypharmacology (i.e., drugs acting upon multiple rather than singletargets) over single target drugs [242], there is a renewed emphasis for developing broadlyprotective immunotherapeutics against infectious diseases. Accordingly, investigating host-pathogen interactions through the lens of protein networks may provide valuable insights intothe rational design of novel classes of broad-spectrum vaccines and immunotherapeutics.

Chapter 5

A Review on Prediction ofIntra-species PPIs

5.1 Literature Review

A wide range of computational methods have been previously suggested to predict protein-protein interactions (PPIs) within a single organism. Initial methods used genomic locationsof genes, phylogenetic profiles, domain-domain profiles, and sequence homology to predictPPIs. More recent techniques have integrated a number of functional genomic data typessuch as gene expression and knockout phenotype and used sophisticated machine learningframeworks such as Bayesian networks, decision trees, random forests, and support vectormachines to predict PPIs. Additionally, methods have been suggested for making predic-tions based on the structure known PPI networks. We provide a brief overview for each ofthese different classes of predictors and give specific examples from published literature (seeFigure 5.1).

5.1.1 Rosetta stone

The Rosetta stone method is based on the observation that some multi-domain proteinsfound in one organism are found as separate proteins in another organism. It has beenproposed that these separate proteins are likely to be interacting. Aloy et al. [5] identifiedseveral different classes of such Rosetta interactions. An example is the enzyme imidazoleglycerophosphate synthase, which is composed of two different structural domains: a his-tidine biosynthesis domain and a glutamine amidotransferase domain. In archaea, thesedomains are encoded by separate genes whose products form a complex; in eukaryotes thesegenes are fused into one (see Figure 5.2).

57

58

Marcotte et al., 1999 [153]

Marcotte et al. used the Rosetta stone method to identify potential interacting proteins inEscherichia coli . They made predictions using two different criteria: domains and sequence

Figure 5.1: Subset of previously published methods for predicting intra-species protein-protein interactions.

59

alignment. They identified domains present in every protein in the manually annotatedportion of the UniProt database [11]. For a given protein p containing domains d and e,Marcotte et al. said the domain-pair (d, e) is linked if they were able to identify two pro-teins, g and h, in some other organism where g only contains domain d and h only containsdomain e. Marcotte et al. add the restriction that the proteins g and h must be non-homologous. In such cases, proteins g and h were predicted to interact. Using this approachthe identified 3,531 probable PPIs. As an alternative to the domain-based approach, Mar-cotte et al. also presented an alignment based method, where predictions were based on theeven that two proteins from an organism g and h interact if they aligned to a single proteinin another organism without any overlap. This method produced 4,487 predictions. Theyshowed that pairs of predicted interactions tend to be annotated with more closely relatedfunctions that randomly paired proteins and a small fraction (6.4%) have been shown tointeract experimentally in the Database of Interacting Proteins [196], which at the time onlycontained 948 PPIs. They performed no cross-validation procedures. Examples of predictedinteractions can be seen in Figure 5.3.

Figure 5.2: Histidine biosynthesis and a glutamine amidotransferase domains of the enzymeimidazole glycerophosphate synthase from Thermontoga maritima and from Saccharomyces

cerevisiae. Dashed arrows show the sequence similarities between the different subunits andthe solid arrow shows the structural similarity. Image and caption taken with permissionfrom Aloy et al. [5].

60

Figure 5.3: Five examples of pairs of E. coli proteins predicted to interact by the Rosettastone method proposed by Marcotte et al. [153]. Each protein is shown schematically withboxes representing domains. For each example, a triplet of proteins is pictured: The secondand third proteins are predicted to interact because their homologs are fused in the firstprotein. Image and caption taken with permission from Marcotte et al. [153].

5.1.2 Gene neighborhood

In bacterial genomes, genetic material is organized into regions of related proteins, e.g.,operons. Operons are sets of genes, which are activated by the same transcription factorand have been shown to code for proteins that posses similar functions [222] and are likelyto interact with each other [53]. Unfortunately, this approach is only directly applicable tobacterial genomes, since other genomes do not contain operons.

5.1.3 Phylogenetic profiles and co-conservation

In has been suggested in a number of studies that proteins which function together in apathway or structural complex are likely to evolve in a correlated fashion. Such linkedproteins are likely to either be conserved or eliminated during evolution. This idea wasoriginally applied to predicting the function of uncharacterized proteins by studying theirphylogenetic profiles [176]. Considering a set G of genomes, a single reference genome Gcontaining a set of P proteins is selected. Let G1 be the reference genome. Then for eachprotein p ∈ P in the reference genome, orthologs are identified in each of the non-reference

61

genomes. A binary matrix is constructed where the cell (Gi, pj) receives a value of 1 ifthe non-reference genome Gi contains an ortholog of pj from the reference genome. Theresulting profiles for different proteins p ∈ P are then compared and proteins containingsimilar patterns of presence/absence across the set of input genomes are considered linked,and are likely to share similar functions (see Figure 5.4).

Figure 5.4: Overview of phylogenetic profile method. Colored boxes correspond to ortholo-gous proteins found across an input set of genomes. A value of 1 in the matrix indicates thepresence of a protein within the respective genome. The profiles indicate that proteins p2

and p7 (respectively, p3 and p6) are functionally linked. Edges denote two profiles that differby at most one position.

62

Figure 5.5: Overview of Pazos and Valencia, 2001 [175]. The initial multiple sequencealignments were reduced, leaving only sequences of the same species and consequently thetree constructed from the reduced alignments would have the same number of leaves and thesame species in the leaves. From the reduced alignments, the matrices containing the averagehomology for every possible pair of proteins was constructed. Such matrices contained thestructure of the phylogenetic tree. Finally, the similarity between the datasets of the twomatrices and implicitly the similarity between the two tress were evaluated with a linearcorrelation coefficient.

Pazos and Valencia, 2001 [175]

Pazos and Valencia noticed that the phylogenetic trees for known interacting protein familiestend to show a higher degree of similarity than those of non-interacting. Considering 14 fullysequence microbial genomes, they identified sets of orthologous proteins using Escherichia

coli as a reference genome and a BLAST [6] E-value cutoff of 1 × 10−5. In cases wheremultiple orthologs from the reference genome were identified for a particular E. coli protein,they chose the protein that was most similar to the E. coli protein. This computationyielded multiple protein families, one for each E. coli gene. These protein families werethen aligned ClustalW [230] in order to compute the pairwise distances between all-pairsof proteins within the family. Pazos and Valencia stored these pair-wise distance values ina matrix R for each protein family. Then for each pair of protein families, they refinedthe initial multiple sequence alignments (MSA) by selecting sequences that correspond tocommon species to produce two trees with the same number of leaves. Pazos and Valenciaonly considered pairs of protein families that shared at least eleven species.

Pazos and Valencia then computed the Pearson correlation coefficient between the distance

63

matrices of two protein families to identify which pairs of protein families had similar phylo-genetic profiles and thus were likely interacting (see Figure 5.5). Using different correlationcutoffs they made predictions. For example, using a correlation cutoff of 0.80 they pre-dicted 2,742 PPIs. They discussed the validity of some of their predictions using a smallnumber of known PPIs available at the time, but provided no cross-validation results.

Karimpour-Fard et al., 2007 [121]

More recently, Karimpour-Fard et al. extended the phylogenetic profile method and thestudy of Clustered Co-Conserved (CCC) interactions to identify Cross-Species CCC (CS-CCC) interactions. Their method automated the comparison of CS-CCCs and identifiedinteractions that were unique or absent across different species. They built interaction maps

Figure 5.6: Cross-Species Clustered Co-Conserved method proposed by Karimpour-Fard et al. [121]. After obtaining precomputed PPI networks, all proteins are mapped to asingle organism (E. coli). Next, combined, common, and unique networks are constructedand analyzed.

64

for six γ-proteobacteria using predictions made by the phylogenetic profiling method con-tained in the Prolinks database [30] with a confidence value of at least 60%. Prolinks uses aBLASTP [6] E-value threshold of less than 1× 10−10 to define a homolog of a query proteinto be present in a secondary genome. Proteins for each organism were then mapped back toEscherichia coli , which was used as a target. CCC networks were constructed from the pair-wise links identified in the Prolinks database and CS-CCC networks were constructed usinga breath-first search algorithm. They extended the target network by adding interactionsfor which at least one protein is already contained in the source network and an interactioninvolving that protein was predicted in another organism. They used this method to identifyPPIs that were common to multiple organisms and those which were unique to a particularorganism (see Figure 5.6). They applied this approach to construct comprehensive networksinvolved in motility and secretion and discussed the predicted interactions contained in theseand other networks.

5.1.4 Domain-domain profiles

Physical interactions between proteins are often mediated by specific domains [174]. Anexample in the parasite responsible for causing malaria, Plasmodium falciparum, is the Duffy-binding-like that interacts with the human complement receptor 1 of uninfected Red BloodCells (RBCs) to gain entrance into the RBC [193].

The domain-domain profile method relies on statistics to determine which pairs of domainsappear more frequently within interacting pairs of proteins than expected by random chance.This is done by storing counts of observed domain-pairs within known interactions and thenapplying a scoring criteria to define those which are over-represented. Such pairs of domainsare then used to make predictions. Methods using domain-domain profiles to predict PPIsdiffer in the manner in which they use the resulting matrix to identify overrepresented pairsof domains. For each of the methods discussed below let D(g, d) denote the event thatprotein g contains domain d and I(g, h) be the event that proteins g and h interact. Further,let P be the set of proteins with at least one domain and at least one interaction and Pd bethe subset of proteins in P that contain domain d. Finally, let S be the set of interactionsbetween pairs of proteins in P and let Sd,e be the subset of S where one protein containsdomain d and the other contains domain e.

Sprinzak and Margalit, 2001 [211]

Sprinzak and Margalit first suggested a domain-pair based method using a log-odds trans-formation for each domain-pair value. For a given pair of domains,( d, e), they computedthe log-odds value as:

65

score(d, e) = log2

|Sd,e|

|Pd||Pe|(5.1)

Favorable domain pairs were identified by their positive log-odds value. Sprinzak and Mar-galit used a log-odds cutoff of 2 to identify overrepresented pairs of domains. They predictedPPIs by identifying which pairs of proteins contained at least one of the domain pairs witha log-odds value greater than the cutoff. Sprinzak and Margalit showed they were able toachieve a sensitivity rate of 94% (TP / (TP + FN)), for predictions between proteins in theSaccharomyces cerevisiae system. They did not give a value for precision (TP / (TP + FP))reasoning that since the set of known interactions was so small (2,908 PPIs at the time ofpublication of that work).

Figure 5.7: For sequence signatures and domain-domain profiles a set of known interactionsis used and for each interaction counts are made for observed domain pairs.

66

Kim et al., 2002 [125]

Kim et al. recognized that multiple domain pairs can predict that the same pair of proteinsinteract. They defined a statistical scoring system, the Potentially Interacting Domain (PID)score. For a given interaction between two proteins ( g, h), let Dg (respectively, Dh) be theset of domains contained in g (respectively, h). Kim et al. defined the score for a predictedPPI as the sum of all pairs of domains with a positive PID.

score(g, h) =∑

i∈Dg ,j∈Dh

max

(

log2

( 1/(|Dg||Dh|)

(|Pi|/|P |)(|Pj|/|P |)

)

− τ(i, j), 0

)

(5.2)

Thus the weighted frequency of each PID was counted as inversely proportional to the numberof PIDs derived from each interacting protein pair. The τ term of Equation 5.2 was usedas a correction factor. See Kim et al. [125] for a description of this correction factor. If aprotein pair, ( g, h), had a score greater than 0, then it meant at least one pair of domainshad a PID score greater than 0. Such protein pairs were labeled as interacting. Kim et al.

performed a ten-fold cross-validation using a data set of interactions obtained from theDatabase of Interacting Proteins [196] (10,432 PPIs for the Homo sapiens and Saccharomyces

cerevisiae systems) for the set of true examples and randomly paired proteins from theUniProt database [11] as hypothetical negative interactions. They showed they were able toachieve a sensitivity of 50% (TP / (TP + FN)) and specificity of 98.4% (TN / (TN + FP)).

Deng et al., 2002 [57]

Deng et al. extended the domain-profile method further by considering all possible pairs ofdomains between a pair of proteins and combined these probabilities using a probabilistic“or” and assuming independence between domains. The probability of I(g, h) = 1 i.e.,proteins (g, h) interact was computed as follows:

Pr(I(g, h) = 1) = 1 −∏

d∈Dg ,e∈Dh

(

1 −|Sd,e|

|Pd||Pe|

)

(5.3)

Deng et al. considered the effects of certain error rates associated with large-scale experimen-tal screens of PPIs used in training, and assigned probabilities to domain pairs for makingpredictions by using a Maximum Likelihood Estimation (MLE) approach in conjunctionwith an Expectation Maximization (EM) algorithm. Using different probability cutoff val-ues as indicators for predicting PPIs they achieved sensitivity rates between 75% and 100%(TP / (TP + FN)) and specificity rates between 5% and 45% in the Saccharomyces cerevisiae

system (TN / (TN + FP)).

67

Figure 5.8: Interolog mapping involves taking a set of known interactions from a group ofsource organisms, identifying orthologs in a target organism, and mapping interactions ontothe target.

5.1.5 Protein sequence

The next class of predictive algorithms relies on the amino acid sequence of the interactingproteins. Assuming that PPIs that control critical cellular processes should be evolutionarilyconserved, it is believed that the underlying sequences may provide sufficient informationfor predicting interactions.

Yu et al., 2004 [244]

Yu et al. introduced this concept with the idea of interolog mapping. An interolog is a pairof proteins known to interact in one organism (g, h), where both g and h have an orthologin some other organism, which are known to interact. Interolog mapping is the transferringof interactions from one organism to another using comparative genomics (see Figure 5.8).

68

Yu et al. proposed two methods for mapping interactions onto a target organism. Since asource protein can have multiple orthologs in the target (i.e., paralogs within the target)the generalized mapping method predicted interactions between all pairwise combinationsof orthologs from both source proteins. For example if g had two orthologs in the targetorganism, y1 and y2, and h had one ortholog in the target organism, j, then Yu et al.

predicted two interactions (y1, z) and (y2, z). The second method Yu et al. presentedpredicted interactions only in the case where the orthologs were best reciprocal blast hits.They showed that this method increased sensitivity. We describe this method below.

For each interaction (g, h), Yu et al. identified orthologs of both proteins g and h in a targetorganism. A target organism protein y was considered an ortholog of the source protein g if

1. y had a BLASTP [6] E-value <≤ 10−10 with g,

2. y shared ≥ 80% of g’s residues as described in the BLASTP alignment,

3. y was the only ortholog candidate of g satisfying conditions 1 and 2, and

4. conditions 1, 2, and 3 were reciprocally true.

The score for mapping an interaction(g, h) onto a target was defined in terms of the jointsimilarity, j between protein g (respectively, h) and its ortholog y (respectively, z). Thereare many potential ways to define joint sequence similarity, but Yu et al. stated that differentdefinitions of j do not drastically effect the outcome. They defined j in terms of BLASTE-values and both sequence conservation. Let E(g, y) denote the BLAST E-value for asource protein g and its target ortholog y, and likewise R(g, y) denote the fraction of sharedresidues. Yu et al. defined the joint similarity score jE−value as

√

Eg,y × Eh,z for BLAST

E-values and jidentity as√

Rg,y × Rh,z for shared residues. Yu et al. stated that PPIs can betransferred when an interacting pair of proteins has jE−value ≤ 10−70 or jidentity ≥ 80%.

Shen et al., 2007 [206]

Given an amino acid sequence s, Shen et al. represented s by the set of triads (three con-tinuous amino acids). They partitioned the 20 amino acids into seven groups based onelectrostatic and hydrophobic properties, as shown in Figure 5.9. Thus each triad of aminoacids mapped into one of 343 (7 × 7× 7) combinations of these groups. They representeda protein g using a 343-dimension vector fg: each dimension of fg corresponded to one ofthe 343 groups of clusters and each value in fg is the number of occurrences of the triadin p. Values contained in the vector fg were normalized to account for protein length. If fg,j

denotes the value of dimension j in fg, Shen et al. defined the normalized vector dg as:

dg,j =fg,j − min1≤k≤343(fg,k)

max1≤k≤343(fg,k)(5.4)

69

Figure 5.9: Schematic diagram for constructing the vector space (V , F ) of a protein se-quence. V is the vector space of the sequence features; each feature (vi) represents a triadcomposed of three consecutive amino acids; F is the frequency vector corresponding to V ,and the value of the ith dimension of F (fi) is the frequency that vi triad appeared in theprotein sequence. Image and caption taken with permission from Shen et al. [206]. Copyright2007 National Academy of Sciences, U.S.A.

The normalized vector space dg,h for a pair of proteins, g and h, was simply the concate-nation of dg and dh. They used a Support Vector Machine (SVM) to learn and predictPPIs. SVMs are a powerful and popular approach in machine learning for solving two-classpattern recognition problems. Given the training set S, an SVM classifier computes a planeseparating the vectors in S with label 1 from the vectors with label −1, optionally afterprojecting the vectors to a higher-dimensional feature space. An important feature of SVMsis that the separating plane has maximum margin, where the margin is the distance from theseparating plane to the closest vector. Oftentimes, the projection to a high-dimensional fea-ture space can be succinctly encoded by a kernel function, which computes the dot productof the projections of two vectors without explicitly computing the projections themselves.Shen et al. used a S-kernel function defined as:

K(dg,h, dy,z) = min(

(||dg − dy||2 + ||dh − dz||

2), (||dg − dz||2 + ||dh − dy||

2))

(5.5)

70

After optimization and training using PPIs from the Homo sapiens system, they performeda four-fold cross-validation test and achieved rates of precision >82.23% (TP / (TP + FP))and sensitivity >84.00% (TP / (TP + FN)).

Burger and van Nimwegen, 2008 [35]

Burger and van Nimwegen introduced a Bayesian network method that predicted PPIs us-ing only Multiple Sequence Alignments (MSAs) of protein families and without using anytraining examples. Their method used pairs of MSAs for which it was known that membersof one MSA could interact with members of another (see Figure 5.10). They defined anassignment a as an MSA that implied a common alignment of all sequences of both families.Let MSAb,c denote the MSA of two protein families b and C. They calculated the probabil-ity P (MSAb,c|a) of the entire joint alignment. Given a weight matrix ωαβ, which specifies

Figure 5.10: Illustration of the model used to assign a probability P (D|a) to the jointmultiple sequence alignment D of two protein families given an assignment a of interac-tion partners between them. Sequences from the same genomes have the same color andhorizontally aligned sequences are assumed to interact. The probabilities of pairs of align-ment columns (ij) depend on the number of times nij

αβ that amino acids (αβ) occur in thecorresponding columns, A dependence tree T and the corresponding factorization of theprobability P (D|a, T ) of the entire alignment given the assignment and dependence tree isillustrated at the bottom of the figure. Image and caption taken with permission from Burgerand van Nimwegen [35].

71

the joint probability of seeing an amino acids α at a position i and β at position j (nijαβ is

the number of times the pair occurs in the columns ij of the MSA), they first calculated theprobability P (Mij|ω) as:

P (Mij|ω) =∏

αβ

ωnij

αβ

αβ (5.6)

They used a Dirichlet prior P (ω) ∝∏

αβ ωλ−1αβ and integrated over the unknown weight

matrix ω to get the following:

P (Mij) =212λ

n + 212λ

∏

αβ

nijαβ + λ

λ(5.7)

We refer the reader to Burger and van Nimwegen [35] for a full derivation and discussion ofEquation 5.7. They apply this method to predict interactions in bacterial Two-ComponentSystems (TCSs). TCSs are responsible for signal transduction and consists of two proteins,a histidine kinase and a response regulator [215]. Since they did not use a training set theydo not perform any cross-validation analysis.

5.1.6 Data integration

With the appearance of large genomic data sets, techniques that integrate various types ofheterogeneous data became possible. One of the main advantages of using data integrationis the ability to include various types of data into the prediction process. Therefore given agold-standard dataset, these methods are able to “learn” the characteristics of PPIs basedon each data type and then integrate this information to make predictions.

Jansen et al., 2003 [113]

Jansen et al. first introduced the idea while making predictions in Saccharomyces cerevisiae

and used Bayesian networks, which allow for combining highly dissimilar types of dataand integrating them into a single common framework. Using a set of known interactions,joint probabilities are computed for the features given the category, Pr(Data | Interaction).Then using the computed probabilities and Bayes rule, probabilities of the categories giventhe features are estimated, Pr(Interaction | Data). Instead of a cross-validation approach,Jansen et al. used a likelihood ratio score. They defined the likelihood ratio for a feature f ,as the fraction of known interaction containing that feature (f+) divided by the fractionof non-interactions having the feature (f−), given g+ interacting and g− non-interactingproteins.

72

Figure 5.11: Combination of datasets into probabilistic interactomes. Image and captiontaken with permission from Jansen et al. [113].

L(f) =f+/g+

f−/g−(5.8)

Jansen et al. applied their method to large-scale to predict Probabilistic Interactomes (PIs).They used high-throughput datasets and a fully connected Bayesian network to account forcorrelated evidence between dataset. They named this network their PI-Experimental (PIE)network. Additionally, they used a naıve Bayes network to integrate additional sources ofgenomic data (e.g., GO biological process annotations and mRNA co-expression) and makeadditional predictions. They called this their PI-predicted (PIP) network. Both the PIE andPIP networks were integrated to create their PI-total (PIT) network (see Figure 5.11). Theydeemed a protein pair (g, h) to interact if the pairs combined likelihood ratio exceeded 600.They did not report results for the PIT network, but in the case of PIE and PIP they reportsensitivity levels of 1% and 27% (TP / (TP + FN)).

Zhang et al., 2004 [247]

Zhang et al. used a Decision Tree (DT) and nine genomic attributes to predict co-complexedproteins in Saccharomyces cerevisiae. The DT classifier constructs a tree in which each leafis labeled with one of the classes (e.g., co-complexed or not complexed) and each internalnode corresponds to a single feature-value pair and a subset of protein pairs used in training.

73

The feature chosen at each node is the one that maximizes the separation of the proteinspairs associated with the node based on their classification. A common measure to evaluatethe separation is the Gini index, which measure the purity of each of the child nodes. Givena set of proteins P associated with a particular node n, let pl (respectively, pr) be the totalcount of proteins in P that do (respectively, do not) meet the selection criteria. Further let li(respectively, ri) denote the number of proteins belonging to class i. Considering an inputset of classes C i.e., co-complexed and not complexed, the Gini index of a node n in a DTis computed as follows:

Gini(n) =pl

|P |

1 −

|C|∑

i=1

(

linl

)2

+pr

|P |

1 −

|C|∑

i=1

(

ri

nr

)2

(5.9)

The process of splitting at internal nodes in a DT is repeated until all leaves are pure i.e.,contain only co-complexed or not complexed examples. After the splitting process stops DTsare typically pruned by removing branches that provide the least additional predictive power.Rather than leaves containing binary values for classification, Zhang et al. constructed a prob-abilistic decision tree in which each leaf-node in the DT contains an associated probability.Using a four-fold cross-validation they were able to obtain a precision rate of about 80%(TP / (TP + FP)) with a false-positive rate of only 1% (1 - (TP / (TP + FN)).

Qi et al., 2006 [186]

Qi et al. provided a head-to-head comparison between a number of machine learning tech-niques. In total, they compared six machine learning algorithms; SVM, Bayesian networks,and decision trees, which have already been discussed, along with logistic regression, randomforest, and random forest k-nearest neighbor. Logistic regression is a generalized statisticalmodel that can predict a discrete outcome from a set of variables of different types. Therandom forest method is an extension of a decision tree in which many decision trees are builtsimultaneously, with each DT using a random subset of the features. Given an unclassifiedpair of proteins, each tree decides whether the pair is interacting or not. The final status isassigned based on a majority rule among all the trees. Finally, the random forest k-nearestneighbor classifier is an extension of the random forest algorithm. Taking the resulting list ofvotes for a particular pair of proteins, the pair is mapped into n-dimensional space (n beingthe number of random trees constructed) and assigns a classification based on the majorityrule of the k-nearest neighbors.

Qi et al. made the distinction between the different types of PPIs (e.g., direct, co-complex,and co-pathway) and analyzed each case separately using seventeen types of genomic dataand interactions from numerous organisms. They reported that in all cases the random forestand k-nearest neighbor methods performed the best (precision >80% (TP / (TP + FP)) inall cases and recall between 20% and 60% (TP / (TP + FN)) with the SVM approach coming

74

in a close third. Using the random forest classifier, they addressed the question of whichdata type is the best for predicting PPIs. This was done by comparing the Gini indices forthe data types, which signifies the data type’s ability to partition the training set into pureclassification groups. For all three types of PPIs, gene expression data was the most usefulin classifying true interactions followed by Gene Ontology annotations [10].

Huang et al., 2007 [104]

Rather than using machine learning, Huang et al. defined a scoring system that integratesfive types of evidence to predict PPIs; interologs (I), domain-domain profiles (D), tissuespecificity (T ), sub-cellular localization (L), and cell-cycle stage (P ).

To define interologs, Huang et al. used known interactions in several model organisms topredict interactions between Homo sapiens proteins. Rather than using raw E-values fromBLAST [6], they used the score provided by the INPARANOID algorithm [190]. We re-fer the reader to Remm et al. [190] for a full description of the INPARANOID algorithm.Considering a pair of hypothetical interacting proteins (g, h), let ig (respectively, ih) denotethe INPARNOID score of g (respectively, h) when compared to some protein from anotherorganism y (respectively, z), where y and z are known to interact. Huang et al. definedan evolutionary conservation parameter, ω, where organisms closely related to the targetorganism receive values closer to one, otherwise less than one. The interolog score si of apair of predicted interactors, (g, h), was then computed as:

si = ω × min(ig, ih) (5.10)

Huang et al. computed domain-domain profiles in a manner similar to methods presentedearlier in Section 5.1.4. The score sd of a domain pair (d, h) was computed as the fraction ofinteractions in which the domain pair occurred over the total possible number of interactionsin which the pair could have occurred. Tissue specificity score st was calculated by summingthe number of common tissues the two proteins, (g, h), showed at least a two-fold up-regulation across a number of gene expression data sets. The sub-cellular localization score sl

was calculated as the deepest level of common gene ontology (GO) term [10] within the“Cellular Component” namespace. Finally, the cell-cycle stage score sc was computed asthe number of overlapping cell-cycle phases between the two proteins, (g, h). Letting τ(si)denote the average score for evidence i among all the considered evidences E, they defineda single confidence score, cs, for a pair of proteins, (g, h) :

cs(g, h) =∑

i∈E

si(g, h)

τ(

si(g, h)) (5.11)

They made predictions between H. sapiens proteins using known interactions from a num-

75

ber of other organisms: R. norvegicus , M. musculus , D. melanogaster , A. thaliana, andS. cerevisiae. They did not report rates of sensitivity or precision.

5.1.7 Graph structure

The last class of predictive algorithms uses the structure of known PPI networks to inferpotential missing interactions.

Yu et al., 2006 [245]

Yu et al. suggested a method of predicting PPIs based on completing defective cliques withinthe known PPI network of Saccharomyces cerevisiae. A clique in a graph is a subset of nodesand edges such that every pair of nodes in the subset is connected by an edge. In biology theseoften correspond to protein complexes. By examining known PPIs and identifying defectivecliques (i.e., those missing only a few interactions to be a fully connected component) providesan effective method for identifying probable missing interactions.

Let kpq be a defective clique containing the nodes p and q, but kpq does not contain the edge(p, q). On a set of nodes V , it is easy to see that kpq is the union of two cliques of the nodessets V −{p} and V −{q}. This observation motivated the following algorithm for predictingPPIs:

1. Find all cliques in the network,

Figure 5.12: A defective clique k of size n that contains one missing edge(

(p, q))

is the sameas the union of two complete cliques of size n-1, p and q.

76

2. Find pairs of cliques overlapping on all but one node each,

3. In each of these pairs, predict the edges between the non-overlapping nodes, and

4. Add the new edges to the network.

To address scalability issues with real PPI networks Yu et al. added additional restrictionssuch as a pair of cliques must overlap by at least k nodes and have at most l non-overlappingnodes. They found that values of k = 4 and l = 3 performed best. They made predictionsusing large-scale data sets of PPIs in Saccharomyces cerevisiae and verified the predictionsbased on a manually curated database of yeast interactions [158]. They showed a number ofexamples where they are able to make predictions for defective cliques that are supportedby known interactions in the literature (see Figure 5.13). Unfortunately, the method isrestricted to predicting PPIs that are part of protein complexes.

Figure 5.13: Example of completing defective cliques in high-throughput experimentaldatasets. RRP43, RRP4 and RRP42 are three proteins in the yeast exosome involved inRNA processing. Large-scale datasets have these three proteins as disconnected yieldingthree separate cliques from this one complex. Image and caption taken with permissionfrom Yu et al. [245].

77

5.2 Application to Host-Pathogen Systems

While each of the previously mentioned methods has been applied to predicting PPIs withina single organism, little work has been done to develop such methods for predicting host-pathogen PPIs. Computational prediction of host-pathogen interactions is an importantunsolved problem. Rosetta stone and gene neighborhood methods do not have any biologicalapplication to host-pathogen systems. There is no evidence to suggest that because a host-pathogen pair of proteins contain two different domains found to be in a single protein ofanother organism, they are likely to interact. It is possible that phylogenetic profile methodscould be used to predict PPIs if there were more datasets available for closely related host-pathogen systems. This follows from the idea that pathogens often co-evolve with theirhosts. Methods such as domain-profiles could be adapted to predict host-pathogen PPIsif suitable training datasets could be identified (see Chapter 6 for a discussion of such anapproach). An interolog-like method has been described for predicting host-pathogen PPIs.Recently, Davis et al. [55] described a method that uses interolog mapping from a largenumber of bacterial PINs to predicting PPIs in a large number of human-pathogen systems.In addition to interolog mapping, Davis et al. also used the 3D structures of known interactingresidues to predict physical interactions. We have discussed a number of machine-learningand data-integration based methods for predicting intra-species PPIs. These methods aredifficult to adapt to host-pathogen systems for two reasons. First, experimental studiestest a small number of such PPIs at a time, making it difficult to build large true-positivedatasets required for training. Second, a number of data types used to train the previously-mentioned methods, such as gene expression and knockout phenotypes, are not available forhost-pathogen systems. For example, gene expression datasets are available for either thehost upon infection or the pathogen during some stage within its host, but simultaneousgene expression measurement of both host and pathogen upon infection are very rarelyavailable. In the rare cases where a large number of host-pathogen PPIs are available e.g,the human-HIV system, developing supervised algorithms for predicting host-pathogen PPIsis applicable. However, the key is to define a feature space for which you can get sufficientdata. We discuss one such approach in Chapter 7. The last class of algorithms we discussedwere those that make predictions based on the structures of known intra-species PINs. Againsince data is so scarce for host-pathogen systems, it would be difficult to apply such methods.However if such data were available, adapting these methods to find incomplete bipartitecliques could be used to predict PPIs. Such cases represent a complex of host proteins knownto interact with a complex of pathogen proteins.

In light of the severe lack of PPI data available for host-pathogen systems, it becomes ap-parent that developing computational methods will require previous methods to be adaptedand new innovative approaches to be developed. These predictive models can only grow incomplexity as more data becomes available.

Chapter 6

Predicting Host-Pathogen PPIs UsingDomain Profiles

A wide range of computational methods have been developed to predict PPIs within asingle organism. Initial methods used sequence-signature pairs [211], protein domain pro-files [125, 167], and sequence homology [244] to predict PPIs. More recent techniques haveintegrated a number of functional genomic data types such as gene expression and knock-out phenotype and used sophisticated machine learning frameworks such as Bayesian net-works [113], decision trees [247], random forests, and support vector machines [186] to predictPPIs.

Computational prediction of such interactions is an important unsolved problem, which ismade difficult by two factors. First, experimental studies test a small number of such PPIsat a time. Only recently have efforts started to collate known host-pathogen PPIs into acomprehensive publicly-available database [117]. Second, a number of data types used totrain the previously-mentioned methods, such as gene expression and knockout phenotypes,are not available for host-pathogen systems. For example, simultaneous gene expressionmeasurement of both host and pathogen upon infection are very rarely available.

Here we introduce a novel framework for predicting and studying host-pathogen PPI net-works. We use intra-species PPIs and protein-domain profiles to compute statistics on howoften proteins containing specific pairs of domains interact. We use these statistics to predictinter-species PPIs in host-pathogen systems. Since gold-standard datasets of experimentally-verified host-pathogen PPIs are not readily available, we develop three computational teststo assess the validity of our predictions:

1. We identify pairs of host proteins that we predict to interact with the same pathogenprotein and measure the distance between the host proteins in the host PPI network.We compute the distribution of these distances over all predicted interactions. Wecompute a similar distribution of distances in the pathogen PPI network between pairs

78

79

of pathogen proteins we predict to interact with the same host protein.

2. We select DNA microarray datasets measuring gene expression during various stagesin the parasite’s life cycle and in host cells infected by the parasite. For pairs of hostproteins defined as in the distance analysis, we compute distributions of the Spearman’scorrelation between the expression profiles of the proteins in a pair. We compute similardistributions for pairs of pathogen proteins.

3. We compute pairs of Gene Ontology (GO) [10] functions (one function annotating hostproteins and the other annotating pathogen proteins) that are enriched in our predictednetwork.

We predict a total of 516 interactions with a probability at least 0.50 in the H. sapiens -P. falciparum system (henceforth referred to as human-Plasmodium). We show that humanprotein pairs we predict to interact with the same Plasmodium protein are close to eachother in the human PPI network, indicating that they are likely to be involved in similarbiological processes. Additionally, Plasmodium pairs predicted to interact with same humanprotein are coexpressed in DNA microarray datasets measured during various stages of thePlasmodium life cycle. Finally, we identify functionally enriched subnetworks in our predictednetwork and discuss their biological significance. For example, we identify a sub networkconnecting human proteins involved in blood coagulation to Plasmodium proteins that are“integral to membrane”. This sub network contains malaria proteins known to be involvedin pathogenesis. Additionally, our analysis finds enriched subnetworks that cover ten ofthe fifteen GO functions listed by Ockenhouse et al. [168] that were enriched in genes up-regulated in individuals infected with malaria. These results demonstrate that we indeedidentify plausible PPIs between human and P. falciparum proteins.

6.1 Methods

We start with a set of intra-species PPIs and the domains present in each of the interact-ing proteins. For every pair of functional domains a and b, we use Bayesian statistics toassess the probability that two proteins, one containing domain a and the other containingdomain b, will interact. We use these domain-pair statistics to predict interactions betweenhost proteins and pathogen proteins and to combine predictions for a single host–pathogenprotein pair stemming from distinct domain pairs. After introducing our model for makingpredictions we introduce three computational tests we developed for evaluating the predic-tions.

80

6.1.1 Predicting host-pathogen protein-protein interactions usingdomain-profiles

We first introduce some notation. Let D(g, d) denote the event that protein g containsdomain d and I(g, h) be the event that proteins g and h interact. We use Pr{g, h|d, e} todenote the probability that proteins g and h interact given that g contains domain d and hcontains domain e and Pr{d, e|g, h} to denote the probability protein g contains domain dand protein h contains domain e given that g and h interact. We use Bayes rule to computePr{g, h|d, e}.

Pr{g, h|d, e} =Pr{d, e|g, h}Pr{I(g, h)}

Pr{D(g, d), D(h, e)}(6.1)

Let P be the set of proteins with at least one domain and at least one interaction and letPd be the subset of proteins in P that contain domain d. Let S be the set of interactionsbetween pairs of proteins in P and let Sd,e be the subset of S where one protein contains dand the other contains e.

For every pair of domains d and e (d and e may be identical), we estimate each of theprobabilities on the right hand side of equation (6.1) from data. Pr{d, e|g, h} is the fractionof interactions where one protein contains domain d and the other contains domain e:

Pr{d, e|g, h} =|Sd,e|

|S|(6.2)

Pr{I(g, h)} represents the probability that a pair of proteins interact, which can be computedas the number of known interactions divided by the total number of possible interactions:

Pr{I(g, h)} =|S|(

|P |2

) (6.3)

Here we use(

|P |2

)

, rather than |P |2, to avoid counting self-interacting proteins. Finally,Pr{D(g, d), D(h, e)} is the probability that if we choose two proteins at random, one willcontain domain d and the other domain e. We can estimate this probability as follows, witha correction to account for the situation when the same protein contains both domains:

Pr{D(g, d), D(h, e)} =|Pd||Pe| − |Pd ∩ Pe|

(

|P |2

) (6.4)

Substituting each of these estimates back into (6.1) we get the following:

81

Pr{g, h|d, e} =|Sd,e|

|Pd||Pe| − |Pd ∩ Pe|(6.5)

Multiple pairs of domains may predict that the same pair of proteins interact. We integrateall these predictions, assuming that they are independent. Denoting by Mg the set of domainscontained in protein g, we have

Pr{I(g, h)} = 1 −∏

d∈Mg

∏

e∈Mh

(1 − Pr{g, h|d, e}) (6.6)

We do not correct for situations where multiple domains occur in a correlated manner ininteracting proteins. Our analysis indicates that statistics on co-occurrence of more thantwo domains in interacting proteins are currently too sparse to be useful (data not shown).

To apply these ideas to a host–pathogen system, we use InterProScan [187] to identifydomains in each host and pathogen protein. For every pair of host and pathogen proteinsthat contain at least one domain, we use equation 6.6 to estimate the probability that theproteins interact. We discard all predictions where Pr{I(g, h)} < 0.5. Let G be the resultingweighted bipartite graph of predicted interactions.

6.1.2 Prediction proximity

Since a protein’s function is governed by the other proteins it interacts with and by its indirectneighbors, we ask if two host proteins that we predict to interact with the same pathogenprotein are close to each other in the host PPI network. Specifically, for each triplet (g, h, p)in G where we predict that the host proteins g and h interact with pathogen protein p, wecompute the length of the shortest path between g and h in the host PPI network. We plotdistributions of these triplet distances. We expect that there should be a negative correlationbetween the number of such pairs at a particular distance and the distance itself. This resultwould be significant because the closer g and h are in the host PPI network, the more likelythey are to share a similar function. We plot similar distributions for all triplets (h, p, q)in G where we predict that the host protein h interacts with pathogen proteins p and q.

6.1.3 Prediction correlated gene expression

A number of papers have demonstrated that interacting proteins in the same organism tendto have correlated gene expression patterns [84, 112]. We reason that proteins we predict tointeract should show similar behavior. However, available gene expression data sets measureexpression in either the host or the pathogen but not in both simultaneously. Therefore,we consider triplets (h, p, q) in G where we predict that the host protein h interacts with

82

pathogen proteins p and q. We ask if the gene expression profiles of p and q are correlated.We plot the distribution of the Spearman’s correlation coefficient of the expression profilesof p and q. We plot similar distributions for all triplets (g, h, p), where we predict that hostproteins g and h interact with pathogen protein p.

6.1.4 Weighted functional enrichment

We can further assess the quality of our predictions by measuring the functional coherenceof the predicted network. We find pairs of functions such that proteins annotated with thefunctions in the Gene Ontology (GO) [10] have a surprisingly large number of predictedinteractions. The hypergeometric distribution is often used to identify which biologicalattributes are enriched in a subset of genes of interest. However, when applied to ourcontext, this distribution cannot take into account the probability we associate with eachpredicted interaction. Therefore, we apply the procedure described below.

Given a pair of GO functions c and d, let Gc,d be the subgraph of G induced by the hostproteins annotated with c and pathogen proteins annotated with d. We define the weight wc,d

of Gc,d as the sum of the probabilities of the interactions in Gc,d. We assess the statisticalsignificance of wc,d as follows. Let k (respectively, l) be the number of host (respectively,pathogen) proteins in G annotated with the function c (respectively, d). We ask the followingquestion: over all possible ways of selecting k host proteins and l pathogen proteins, whatfraction of choices will induce a subgraph of G whose weight is at least wc,d? We set thisfraction to be the p-value pc,d of the pair of functions (c, d).

To assess pc,d, we generate multiple random sets of functional annotations for all host andpathogen proteins. For each random set of annotations, we compute the weight of thesubgraph of G induced by the host proteins randomly annotated with c and the pathogenproteins randomly annotated with d. Since functions in GO are specified at multiple levelsof detail, annotations obey the true path rule, i.e., a gene annotated with a function c is alsoannotated with all ancestors of c. Therefore, we ensure that each random set of annotationsalso satisfies the true path rule. We first apply the true path rule to the annotations. Togenerate a random set of annotations for host (respectively, pathogen) proteins, we ran-domly select a pair of host (respectively, pathogen) proteins and swap the sets of functionsannotating them. We repeat this process for multiple pairs of proteins. This procedure isa modification of well-known methods for graph randomization applied to bipartite graphs(e.g., see [204]).

We apply these steps to every pair of functions in GO and retain only those pairs (c, d) forwhich pc,d ≤ 0.05. Since functions in GO are specified at multiple levels of detail, the set ofenriched function pairs may contain closely related pairs of functions. We use the followingcriteria to collapse the enriched pairs to the most specific and most enriched function pairs.From the set of all enriched pairs, we remove a pair of functions (f, g) if there is anotherpair of enriched functions (l,m) such that

83

Figure 6.1: Two enriched pairs of functions (l,m) and (f, g) where l is an ancestor of fand m is an ancestor of g. Dashed lines denote paths between functions in GO defined byparent-child relationships between them.

1. pl,m < pf,g i.e., (l,m) is more statistically significant than (f, g),

2. l is either an ancestor or a descendant of f , and

3. m is either an ancestor or a descendant of g.

see Figure 6.1 for an example.

6.1.5 Datasets used

We downloaded all data used in this study in July 2006, except for the gene expression data,which we obtained in December 2006.

Genomic information

We used the Uniprot database [11] as a source for protein sequence information. We usedInterProScan [187] as our method for determining protein domain profiles. We obtainedfunctional annotations from GO [10]. We use these annotations to construct training andtesting sets of PPIs relevant to pathogenesis. We also explore the use of other proteinsfeatures such as signal peptides and transmembrane regions. We deemed that a protein issecreted if it is so identified by either by the SignalP [22] or the SecretomeP [21] algorithms.To identify membrane bound proteins, we used the TMHMM system [129].

84

Protein-protein interaction data

We gathered human, fly, and Plasmodium PPIs from five databases: the Biomolecular Inter-action Network Database [80], the Database of Interacting Proteins [196], IntAct [96], theMunich Information Center for Protein Sequences [86], and Reactome [117]. After removingduplicate interactions and self interactions, we obtain a total of 39,207 human, 18,412 fly,and 2,643 Plasmodium interactions.

Gene expression profiles

We considered a number of gene expression datasets for triplet coexpression analysis. Thesedatasets are available from NCBI’s GEO [65] and from previously published studies.

All the Plasmodium expression datasets focus on merozoite invasion of human red blood cells(RBCs). Bozdech et al. [31] and Le Roch et al. [134] measure time courses of gene expressionduring the intra-erythrocytic life cycle within the RBC. These two datasets contain 46 sam-ples and 17 samples respectively. The dataset published by Le Roch et al. [134] contains twotime courses each with seven samples where cells are synchronized under different conditions.The last three samples measure expression within gametocytes and sporozoites, which areimportant during the mosquito and initial human infection stages of the Plasmodium lifecycle. We did not consider datasets that contained very small numbers of samples [20, 217].

The human datasets measure gene expression by isolating peripheral blood mononuclear cells(PBMCs) from individuals. The unpublished Boldt et al. dataset (GEO series GSE1124)contains 15 samples from Gabonese children that are either healthy or show uncomplicatedor severe symptoms of malaria. The Ockenhouse et al. [168] dataset measures 71 expressionprofiles from people who are either experimentally or naturally infected with malaria.


We apply our method to the human-Plasmodium host–pathogen system. As a negativecontrol, we also predict PPIs for the hypothetical fly-Plasmodium host–pathogen system. Inorder to focus our predictions on Plasmodium proteins likely to be involved in pathogenesis,we generate our training set of proteins and PPIs as follows.

1. We remove Plasmodium proteins annotated with mitochondria, nucleus, ribosome, cellprocess, helicase activity, proteosome complex, nuclease activity, nucleic acid bind-ing, or nucleotide binding. We also remove human and fly proteins annotated withribosome, nucleic acid binding, nucleotide binding, nucleoside binding or proteolysis.

85

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

Fra

ctio

n of

Pre

dict

ions

Probability of Prediction

1 (17,064)2 (3,699)3 (1,171)

4 (516)5 (277)

Figure 6.2: Distributions of predictions made using different cutoffs for classifying uncommondomains. Numbers in parenthesis represent the number of predictions made using that cutoff.

2. We add Plasmodium proteins annotated with hemoglobin metabolism, dense gran-ule, subtilisin activity, protein folding, polymerization, cell-cell communication, or celldeath, as well as human and fly proteins annotated with blood coagulation, cell-cellcommunication, protein folding, polymerization or cell death. We add these proteinseven if they were removed in the previous step.

3. We remove proteins that do not participate in any PPIs.

4. For every domain, we count the number of proteins in which the domain occurs. Weconsider a domain to be infrequent if it occurs in less than four proteins. Giventhe nature of how the predictions are made, any domain pairs which only appear afew times and in an interacting proteins will cause all proteins pairs containing thosedomains to be predicted with a probability of 1. Setting the domain cutoff at 4 removesall predictions with a probability of 1 (see Figure 6.2). We remove all proteins thatcontain only uncommon domains. We also ignore the presence of an uncommon domainin the remaining proteins.

In the training set, we include an interaction only if we have retained both the interactorsafter these steps. Finally, we have a training set with 4,177 human PPIs spanning 2,196human proteins, 9,384 fly PPIs spanning 3,864 fly proteins, and 127 Plasmodium PPIs span-ning 120 Plasmodium proteins. We create the set of proteins for prediction by applying only

86

Table 6.1: Distribution of the number of predicted host–pathogen PPIs for different rangesof probability.

PPI Probability #human-Plasmodium PPIs #fly-Plasmodium PPIs0.50 – 0.55 185 60.55 – 0.60 175 150.60 – 0.65 31 110.65 – 0.70 61 120.70 – 0.75 16 00.75 – 0.80 48 0Total 516 44

steps (1), (2), and (4). We obtain a universe of 27,371 human proteins, 11,924 fly proteins,and 938 Plasmodium proteins on which to make predictions. We explore the use of otherfeatures for building training and testing data sets of PPIs (see Section 6.2.4).

We predict 516 human-Plasmodium PPIs with a probability of at least 0.50. These predic-tions involve 158 human proteins and 30 Plasmodium proteins. Figure 6.2 displays a layoutof this network. We predict 44 fly-Plasmodium PPIs with a probability of at least 0.50.These predictions involve 29 fly proteins and 8 Plasmodium proteins. No malaria proteinsparticipate in both predicted networks. Table 6.1 displays the number of PPIs we predictfor different ranges of probabilities. The marked difference in the number of PPIs predictedfor the two systems suggest that our methodology indeed identifies plausible interactionsbetween host and pathogen proteins.

6.2.1 Triplet proximity

In the human-Plasmodium network, we use the phrase “H-H-P triplet” to refer to two humanproteins predicted to interact with the same Plasmodium protein. Similarly, we use thephrase “H-P-P triplet” to refer to a human protein predicted to interact with two Plasmodium

proteins. We compute the fraction of triplets such that the two human proteins are adistance k apart in the human PPI network, for different values of k ≥ 1. Note that thisnetwork contains all 39,207 interactions between human proteins. Figure 6.4 displays thesedistributions for H-H-P and H-P-P triplets.

Of the 158 human proteins predicted to interact with Plasmodium proteins, only 31 haveknown interactions in the human PPI network. There are a total 582 H-H-P triplets and31 H-P-P triplets. Figure 6.4(a) demonstrates that as many as 72% of human protein pairsin H-H-P triplets are at a distance of two or less in the human PPI network. Thus ourpredictions are likely to connect human proteins with functional relationships. The averagedistance between Plasmodium proteins in H-P-P triplets is 5.5 (see Figure 6.4(b)), probablybecause the Plasmodium PPI network is sparse and contains only 2,643 interactions. For the

87

fly-Plasmodium predictions, we have 30 H-H-P triplets and 5 H-P-P triplets. These countsare too small for us to draw any conclusions.

6.2.2 Triplet correlation of gene expression

For each H-P-P triplet, we compute the Spearman’s correlation coefficient of the gene ex-pression profiles of the two proteins in a DNA microarray dataset. We divide the range ofthe coefficient into bins, and plot the fraction of triplets that fall in each bin. Figure 6.5displays these results.

Figure 6.3: A layout of the predicted human-Plasmodium PPI network. Blue circles arehuman proteins. Red diamonds are Plasmodium proteins. Solid grey edges are predictedPPIs

88

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7 8 9

Fra

ctio

n of

Trip

lets

Intra-species Distance

H-H-P (582)H-P-P (31)

Figure 6.4: Distributions of H-H-P and H-P-P intra-species distances for the triplet proximityanalysis. Numbers in parenthesis are the number of triplets.

Combining the dataset of Bozdech et al. [31] with the predicted interactions yields 188 H-P-P triplets. We only consider triplets where both Plasmodium proteins have measuredexpression profiles. Figure 6.5(a) demonstrates that pairs of Plasmodium proteins that wepredict to interact with the same human protein are co-expressed. We obtain similar resultsfor the 387 H-P-P triplets yielded by combining the predicted interactions with the datasetof Le Roch et al. [134] (Figure 6.5(b)). For these two gene expression data sets, we have5 and 30 H-P-P triplets in the fly-Plasmodium network, respectively. These counts are toosmall for us to draw any conclusions.

When we integrate predicted H-H-P triplets with human gene expression data sets, we ob-tain coexpression distributions that show more variance than those for the H-P-P triplets.Figures 6.5(c) and 6.5(d) display the results for the data of Ockenhouse et al. [168] andBoldt et al., respectively. Each Figure plots data for 392 triplets. A potential reason forthis outcome is that the human proteins targeted by Plasmodium proteins trigger signalingcascades (e.g., an immune response) that control the expression of a different set of humanproteins. This hypothesis is strengthened by the fact that when we restrict our attentionto H-H-P triplets involving human proteins known to be localized to the cell surface or theplasma membrane, we obtain distributions similar to those in 6.5(c) and 6.5(d) (data notshown).

89

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

-1 -0.50 0 0.50 1

Fra

ctio

n of

Trip

lets

Spearman’s Coefficient

Boldt (392)Ockenhouse (392)

(a) H-H-P triplets

0

0.05

0.1

0.15

0.2

0.25

0.3

-1 -0.50 0 0.50 1

Fra

ctio

n of

Trip

lets

Spearman’s Coefficient

Bozdech (188)Le Roch (387)

(b) H-P-P triplets

Figure 6.5: Distributions of H-H-P and H-P-P Spearman’s correlations for the triplet co-expression analysis. Numbers in parenthesis are the number of triplets.

90

Figure 6.6: Life cycle of the parasite Plasmodium falciparum.

6.2.3 Functionally enriched subnetworks

To compute pairs of enriched functions in the predicted networks, we generate 1,000,000random sets of human and Plasmodium GO annotations. We discard all pairs of functionswhose p-value is greater than 0.05. After collapsing the remaining enriched pairs, we remove

91

any pairs in which at least one function has a depth less than three in the GO hierarchy.We measure the depth of a function as the length of the shortest path to the root of thecategory the function belongs to.

We identify 39 enriched pairs of GO functions in the predicted human-Plasmodium networkand none in the fly-Plasmodium network after a Bonferroni correction. Ockenhouse et al. [168]report that genes up-regulated in individuals infected with malaria are enriched in 15 GOterms. In our analysis, we are able to identify ten of these functions in enriched pairs be-fore collapsing: apoptosis, immune response, inflammatory response, intracellular proteintransport, mitochondrion, nuclear mRNA splicing, protein folding, regulation of apoptosis,regulation of transcription, and ubiquitin cycle.

Before discussing the functionally enriched subnetworks in detail, we briefly review the lifecycle of Plasmodium falciparum (see Figure 6.6). Malaria infection spreads by the parasite’sability to infect human hosts and the Anopheles mosquito. When an infected mosquitobites a human, it injects malaria sporozoites into the host’s blood system. The parasitessubsequently invade liver cells. Within hepatic cells, the parasite undergoes schizogony, orasexual reproduction, to form exo-erythrocytic merozoites. These merozoites are releasedinto the blood stream where they invade host erythrocytes (red blood cells or RBCs). Onceinside an erythrocyte, the parasite undergoes rapid multiplication. During this development,the parasite first metamorphoses into the ring stage. Upon further development, it becomesa trophozoite and begins feeding on the host hemoglobin. Subsequently, through asexualreproduction, a trophozoite forms a schizont that gives rise to several merozoites. Theseintracellular merozoites escape from the erythrocyte when it ruptures to subsequently infectnew erythrocytes and continue this erythrocytic life cycle.

During its life within the host RBC, the malaria parasite has specialized mechanisms forcausing physical changes to the RBC [97]. Pf EMP1s are exported to the RBC surface wherethey cause the RBC to become sticky and adhere to the endothelial lining in the capillariesand additional uninfected RBCs through a process known as rosetting. The build-up resultsin circulatory blocking, which restricts the flow of oxygen.

Our analysis finds an enriched subnetwork between human proteins annotated with “bloodcoagulation” and malaria proteins annotated with “integral to membrane” (p-value 3×10−6).This network includes predicted interactions between Q8IAS3, a known Pf EMP1, and severalhuman proteins involved in blood coagulation. One of the predicted interacting partners ofQ8IAS3 is plasminogen (Q5TEH4), which is involved in the degradation of many bloodplasma proteins. An important step in the release of malaria merozoites from infectederythrocytes is the activation of plasminogen [192]. Q8IAS3 is highly expressed during theerythrocytic life cycle [216]. Our analysis predicts that pfEMP1 might interact with hostplasminogen to promote the degradation of RBCs. Additional predicted partners of Q8IAS3include hepatocyte growth factors (HGFs). During the liver stages of the Plasmodium lifecycle, there is a required induction of HGFs for hepatocyte invasion [40]. Thus, our predictionsuggests that Q8IAS3 may also play a key role in hepatic cell invasion by triggering the

92

activation of HGFs.

Platelets are tiny cells in the blood that are important for clotting. Other symptoms ofmalaria infection are bleeding disorders and Thrombocytopenia, which is the presence ofreduced platelet counts and dysfunctional platelets. Besides Q8IAS3, we predict that twoadditional proteins, Q8IAL6 and Q8I339, interact with human blood coagulation proteins.Both proteins are highly expressed throughout erythrocytic life cycle [216]. These two pro-teins are labeled as hypothetical. Each contains a transmembrane domain, suggesting thatthey are localized to the Plasmodium cell surface. Baruch et al. [18] showed that matureparasitized RBCs have an affinity for thrombospondin, which is found in blood platelets.Our predictions suggest that these two proteins may play a role in disrupting human bloodcoagulation pathways. Given that these two proteins are predicted to interact with some ofthe partners of Q8IAS3, they are likely to be members of the malaria pathogenesis pathway.

Our functional enrichment analysis also identifies “subtilase activity” and “merozoite densegranule” as functions annotating Plasmodium proteins that interact with human proteinsinvolved in “blood coagulation” (p-value < 1 × 10−6). The dense granule is a specializedsecretory organelle that excretes a subtilisin-like protease that plays an important role inRBC invasion and degradation [169, 240]. We predict that Q8IHZ5, a known subtilisin-likeprotease, interacts with a number of blood coagulation proteins, which suggests that it mayalso be involved in the degradation of blood platelets. Expression profiles of this protein showit is highly expressed during the sporozoite and merozoite stages as well as the later stagesof the erythrocytic life cycle [216]. These events coincide with the stages where malariaenters the RBC and travels in the blood stream. As in the case of the pathogenesis proteins,we predict that a hypothetical Plasmodium protein Q8IKP8 interacts with the predictedpartners of Q8IHZ5. These two proteins have a similar expression pattern [216] and share ahigh degree of sequence similarity (Bitscore of 2053) [6]. We predict that Q8IKP8 may alsobe a subtilisin-like protein.

6.2.4 Using different protein features to build training set

An alternative to using GO annotations in constructing training and testing datasets isto use protein feature e.g., signal peptides and trans-membrane regions given secreted andmembrane-bound pathogen proteins play an important role in pathogenesis [123, 138]. Herewe explore the use of these features as markers for constructing training and testing PPIdatasets. We construct eight different filters using different combinations of features fromthe SignalP (P), SecretomeP (S), and TMHMM (M) predictions: M, MP, MPS, MS, P,PS, and S. We compare the results from the results from these features compared withthose obtained from the previous discussion, which used the Gene Ontology (G) functions.Additionally, we consider the set of proteins without applying any of the filters (ALL). Weapply the complete prediction pipeline using each of these filters on both the training andtesting steps. We also repeat each of these analyses applying the filters to only the testing

93

Figure 6.7: Number of predicted interactions resulting from the use of different filters duringthe training and testing steps. ALL is the set of all proteins, G is the set of proteins annotatedwith Gene Ontology functions of interest, M is the set of proteins containing transmembrane-domains, P is the set of proteins containing a signal peptide, and S is the set of proteinspredicted to be secreted, but do not contain a signal peptide. The first nine groups weretrained using all proteins, followed by the use of the filters, whereas the next eight groupswere trained and tested on sets of filtered proteins. The dark-blue bars are when we do notuse the BLAST-Filter and the light-blue groups are when we use the BLAST-Filter.

step, while allowing all proteins in the training step. Finally, we explore the use of using apost-prediction BLAST [6] filter. For any predicted interaction between a host protein H1

and a pathogen protein P1, we remove the interaction from the analysis if H1 is orthologousto a known interacting partner of P1 (respectively, P1 with an interacting partner of H1) inthe intra-species protein-protein interaction network. Such predicted interactions may be aresult of a conserved complex with the host and pathogen. We use INPARANOID [190] toidentify orthologous human and P. falciparum proteins.

Figure 6.7 displays the number of predicted interactions for each of the analyses, both without

94

and with the use of the post-prediction filtering process (dark-blue and light-blue, respec-tively.) The number of resulting predictions varies greatly between the various analyses.The set with the most predictions is the set trained using all proteins and predicted onall proteins (ALL-ALL). For this set we predict 11,481 Protein-Protein Interactions (PPIs)without the post-prediction BLAST filter and 10,844 PPIs with post-prediction filtering.The dataset trained using all proteins and then predicted on proteins with signal peptides(ALL-P) contained the fewest predictions, 23 and 16 respectively. Considering each of thedifferent sets we also compute enriched GO functions amongst the predicted PPIs (see Sec-tion 6.1). Figure 6.8 displays a distribution of the number of pairs of enriched functions foreach of the analyses. Again we observe a wide range in the number of predicted interac-tions, 3,903 pairs of functions for the group trained and tested using all proteins (ALL-ALL)to 1 pair of enriched functions for the set trained using all proteins and tested on proteinswith signal peptides (ALL-P).

To explore the similarity between these various analyses we compute the Jaccard of thepredicted PPIs and the enriched functions between all pairs analyses. Figure 6.9 displaysthe Jaccard coefficients of the predicted interactions between the different datasets whenno BLAST-Filtering is used. This image shows that there is little to no overlap betweenthe various sets of predicted interactions. At the functional level we see the exact sameresults (see Figure 6.10). They corresponding images for the sets using the post-predictionBLAST-filter can be seen in Figures A.4 and A.5 of the appendix.

The results suggest that the types of interactions predicted greatly varies given the con-straints placed on the testing and training steps. Given the lack of a gold-stand datasetto test the various predictions it is not possible at this time to determine which feature orcombinations of features are most informative. This will be an important extension of thiswork once such data becomes available.

6.3 Conclusions

Predicting interactions between host and pathogen is a relatively poorly studied problemwith important implications in biomedicine. We have presented an algorithm that integratesprotein domain-profiles with interactions between proteins from the same organism to predictinteractions between host and pathogen proteins. When applied to the human-Plasmodium

system, our method identifies several biologically important sub-networks that can act asthe starting point for therapeutic development.

An important extension to our method is to incorporate reliability estimates of PPIs de-tected by high-throughput screens [220]. There are many Plasmodium proteins known to beimportant to the erythrocytic life cycle [11, 18, 26, 50, 97]. We predict many other interac-tors for both Pf EMP1s and MSP1s but with probabilities less than 0.50. This observationsuggests that integrating additional data sources into our system may enable us to predict

95

Figure 6.8: Number of enriched functions identified amongst predicted interactions resultingfrom the use of different filters during the training and testing steps. ALL is the set of allproteins, G is the set of proteins annotated with Gene Ontology functions of interest, M isthe set of proteins containing transmembrane-domains, P is the set of proteins containing asignal peptide, and S is the set of proteins predicted to be secreted, but do not contain asignal peptide. The first nine groups were trained using all proteins, followed by the use ofthe filters, whereas the next eight groups were trained and tested on sets of filtered proteins.The dark-blue bars are when we do not use the BLAST-Filter and the light-blue groups arewhen we use the BLAST-Filter.

more PPIs involved in malaria invasion of its host with increased confidence.

96

ALL-ALL

ALL-G

ALL-M

ALL-MP

ALL-MPS

ALL-MS

ALL-P

ALL-PS

ALL-S

G

M

MP

MPS

MS

P

PS

S

ALL-ALL

ALL-G

ALL-M

ALL-MP

ALL-MPS

ALL-MS

ALL-P

ALL-PS

ALL-S

G M MP MPSMS P PS S

Figure 6.9: Overlap of the predicted interactions resulting from the use of different filtersduring the training and testing steps. ALL is the set of all proteins, G is the set of proteinsannotated with Gene Ontology functions of interest, M is the set of proteins containingtransmembrane-domains, P is the set of proteins containing a signal peptide, and S is theset of proteins predicted to be secreted, but do not contain a signal peptide. The first ninegroups were trained using all proteins, followed by the use of the filters, whereas the nexteight groups were trained and tested on sets of filtered proteins. Each cell is the Jaccardcoefficient of the predicted interactions between the corresponding row and column datasets.The Jaccard values range from 0 (red) to 0.5 (black) to green (1).

97

ALL-ALL

ALL-G

ALL-M

ALL-MP

ALL-MPS

ALL-MS

ALL-P

ALL-PS

ALL-S

G

M

MP

MPS

MS

P

PS

S

ALL-ALL

ALL-G

ALL-M

ALL-MP

ALL-MPS

ALL-MS

ALL-P

ALL-PS

ALL-S

G M MP MPSMS P PS S

Figure 6.10: Overlap of the enriched functions amongst the predicted interactions resultingfrom the use of different filters during the training and testing steps. ALL is the set of allproteins, G is the set of proteins annotated with Gene Ontology functions of interest, M isthe set of proteins containing transmembrane-domains, P is the set of proteins containing asignal peptide, and S is the set of proteins predicted to be secreted, but do not contain asignal peptide. The first nine groups were trained using all proteins, followed by the use ofthe filters, whereas the next eight groups were trained and tested on sets of filtered proteins.Each cell is the Jaccard coefficient of the predicted interactions between the correspondingrow and column datasets. The Jaccard values range from 0 (red) to 0.5 (black) to green (1).

Chapter 7

Supervised Learning and Predictionof Host-Pathogen PPIs

HIV is a retrovirus that can lead to a failure of the immune system and kills millions of peopleyearly. Despite HIV being a dangerous infectious disease, there is currently no effectivevaccine for it. HIV’s ability to rapidly mutate has resulted several drugs becoming obsolete.A potentially powerful application of Protein-Protein Interaction (PPI) networks lies inusing them to obtain insights into the molecular mechanisms underlying infectious diseases,especially since interactions between pathogen proteins and host proteins play key rolesin initiating and sustaining infection. In Chapter 3, we surveyed the landscape of humanproteins that interact with viruses and other pathogens [64]. We collected host-pathogenPPIs from seven public databases. Apart from strains of HIV and four other viruses, we foundthat for every other pathogen, at most 100 PPIs are currently known between proteins in thatpathogen and human proteins. Therefore, the severe lack of large-scale datasets detailinginteractions between host and pathogen proteins is a significant hurdle to progress in host-pathogen systems biology. Consequently, it becomes imperative to develop computationalmethods that can robustly and accurately predict host-pathogen PPIs. Such predictors canguide cost effective experimental strategies to detect host-pathogen PPIs, drive research onhow pathogens infect host cells, and help identify potential targets for therapeutics.

While a number of methods have been proposed for predicting PPIs, they have primarilyfocused on intra-species PPIs [113, 167, 176, 186, 204, 211, 244, 247]. Applying these methodsto host-pathogen systems is made difficult by two factors. First, as we have already noted,experimental studies on most human pathogens have so far detected very small numbersof PPIs. making it difficult to build comprehensive training sets. Second, a number ofdata types used as features by previous methods, such as gene expression and knockoutphenotypes [113, 186, 247], are not available for host-pathogen systems.

In this chapter, we exploit publicly-available human-HIV PPIs to build the first supervisedpredictor for HP-PPIs. The HP-PPIs come from a number of small-scale experiments and

98

99

manually curated data. We use these data to train a Support Vector Machine (SVM) clas-sifier using different combinations of, including domain profiles, protein properties such asmembrane localization, frequencies of protein sequence k-mers, and network characteristicsof the human proteins in a human PPI network. Using four-fold cross validation, we comparethe performance of a linear SVM kernel on different combinations of features. We find thatusing protein sequence four-mers, protein domains, and PPI network information achievesthe best performance, with precision ranging between 64.91% and 91.70% and recall be-tween 41.61% and 93.89%, depending on the ratio of true positives (TPs) to true negatives(TNs) used during training.

We use this predictor to identify potential viral interacting partners for human proteinsand focus our discussion on those human proteins, and corresponding predicted interactors,known to play an important role in HIV infection [33]. Examples of such predicted interac-tions illustrate how the virus has evolved to manipulate host-cellular proteins to carry abouta successful pathogenesis. For example, we predict interactions with membrane-bound pro-tein such as CXCR4, CD4, and many nuclear pore proteins. These interactions are knownto play a critical in the initial invasion of the cell and subsequent movement of viral materialacross the nuclear membrane. We discuss in depth these and other predicted HP-PPIs.

Finally, we explored the possibility of using a classifier trained on PPIs from the human-HIV system to predict PPIs in another human-virus systems, namely Hepatitis C virus,Herpesvirus, Influenza virus, and Papillomavirus. Depending on the viral system and TP:TNratio used in training we found that we were able to achieve precision and recall levels >60%.

7.1 Methods

We first describe the classifier we use to predict human-HIV HP-PPIs. Next, we present thefeatures we include in this classifier. Finally, we describe our experimental protocol.

7.1.1 Support Vector Machines

The SVM is a powerful and popular approach in machine learning for solving two-classpattern recognition problems. Given the training set S with each vector in S associatedwith a label equal to 1 or -1, an SVM classifier computes a plane separating the vectorsin S with label 1 from the vectors with label -1, optionally after projecting the vectors toa higher-dimensional feature space. An important feature of SVMs is that the separatingplane has maximum margin, where the margin is the distance from the separating plane tothe closest vector.

In this study, we use a Support Vector Machine (SVM) classifier with a linear kernel. For

100

each host-pathogen protein pair (p, q), we compute a feature vector f(p,q), as explained in thenext section. Let S be a training set consisting of (f(p,q), l) pairs, where l ∈ {−1, 1} is theclass label of the PPI (p, q). In our case, the labels 1 and −1 correspond to the classes “PPI”and “non-PPI,” respectively. We use the training set S as input to the SVM classifier. Weuse the SVMLight package [116] for training and testing SVMs. For different combinationsof features, we systematically vary the parameter C, which controls the trade-off betweenmaximizing the margin of the separating plane and minimizing the mis-classification errorto identify the value that yields the maximum accuracy using a four-fold cross validation.For each feature combination, we base further analyses and predictions on the optimal valueof C for that combination.

7.1.2 PPI features

We consider four types of protein features in this study: domains (D), structural features (F),protein sequence k-mers (K), and properties in the intra-species human PPI network (N).We explain the rationale for including each of these features below.

Domains (D) Physical interactions between proteins are often mediated by specific do-mains [174]. For example, the Duffy-binding-like domain of the erythrocyte membrane pro-teins in Plasmodium falciparum plays an important role in infection by interacting with thehuman complement receptor 1 on uninfected red blood cells (RBCs) [193]. This interac-tion allows the parasite to bind to the host RBCs and subsequently invade them. Previousresearch has demonstrated the utility of protein-domain information in predicting both intra-species PPIs [167, 211] and host-pathogen PPIs [63].

Let Mp be the set of domains present in a protein p and let M be the set of all domains,computed over all proteins in our dataset. Our feature vector contains one binary featurefor every pair of domains in M × M . For a PPI (p, q), we set the features correspondingto the domain pairs in Mp × Mq to be 1 and the remaining features to be 0. We encodethe domain features using pairs of domains since the interaction is often contingent on thepresence of the pair. An alternative method that we considered for incorporating proteindomains was to compute from the training data the probability that a protein pair wouldcontain a pair of domains given they interact. We refrain from using this approach becausethese probabilities can not be computed accurately for the currently sparse human-HIV PPIdatasets.

Structural features (F) Secreted and membrane-bound pathogen proteins play an im-portant role in pathogenesis as such proteins are often critical for initial infection of the hostcell [123, 138]. We deem that a human or pathogen protein is secreted if it is so identified byeither by the SignalP [22] or the SecretomeP [21] algorithms. To identify membrane bound

101

proteins, we use the TMHMM system [129]. We include one binary feature for each of theseproperties.

Protein sequence k-mers (K) Since the sequence of a protein determines its structureand consequently its function, it may be possible to predict PPIs using the amino acidsequence of a protein pair. Recently, Shenet al. [206] introduced the “conjoint triad model”for predicting PPIs using only amino acid sequences. Shen et al. partitioned the twentyamino acids into seven classes based on their electrostatic and hydrophobic properties (seesupplementary Table 1 of Shen et al. [206]). For each protein, they computed all three-mers (set of three consecutive amino acids) that occur in the protein’s sequence. Theycounted the number of times each distinct three-mer occurred in the sequence. Since largerproteins contain more three-mers and may not be directly comparable to smaller proteins,they normalized these counts by linearly transforming them to lie between 0 and 1 (seeShen et al. [206] for details). They represented the protein with a 343-element featurevector, where the value of each feature is the normalized count for each of the 343 (73)possible amino acid three-mers. In this paper we explore the use of two-, three-, four-, andfive-mers using the same seven classes suggested by Shen et al. [206]. For each host-pathogenprotein pair, we concatenate the feature vectors of the individual proteins.

Network properties (N) Recent studies have suggested that pathogens have evolved tointeract with human proteins which are hubs (proteins with many interacting partners) [36,64] and bottlenecks (proteins that are central to many paths in the network) [64] in thehuman PPI network (see also Chapters 3 and 4). We represent the human PPI network as agraph G(V,E), where V is the set of human proteins and E is the set of PPIs between them.We define the degree of a protein in a PPI network as the number of interactions in whichit participates, not including self-interactions. We define the betweenness centrality bc(v)of a protein v as the fraction of shortest paths in G between all protein pairs (u, w) thatpass through the protein v. Given u, v, w ∈ V , let σuw denote the number of shortest pathsbetween proteins u and w. There may be multiple equally long paths between u and w thatare shorter than any other path between u and w. Let σuw(v) denote the number of thesethat pass through v. Then the betweenness centrality of v is

bc(v) =∑

u,w∈Vu,w 6=v

σuw(v)

σuw

In our analysis, we divide bc(v) by the number of pairs of nodes in G, yielding a quantitybetween 0 and 1. We use the algorithm devised by Brandes [32] to compute the betweennesscentrality of all nodes in G. We include two features for these properties: an integer-valuedfeature for a human protein’s degree and a real-valued feature for its betweenness centrality.

102

7.1.3 Evaluation of performance

We test the predictive power of twelve combinations of features: D, DF, DFK, DFKN,DFN, DK, DKN, DN, FK, FKN, K, and KN. We do not test the predictive power of thecombinations F, N, and FN because the coverage of these features is small. We use four-fold cross-validation procedure to test the performance of the SVM. We count the numberof true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN)and compute the precision (TP / TP + FP) and recall (TP / TP + FN) of each featurecombination. We use these data to plot precision/recall curves. We also use accuracy(TP + TN / TP + FP + TN + FN) as a criteria for comparing the different feature setsand selecting the optimal value of the SVM parameter C.

7.1.4 Datasets used

All data used in this study was downloaded in February 2008.

Genomic information

We used the Uniprot database [11] as a source for protein sequence information. We usedInterProScan [187] as our method for determining protein domains. We used SignalP [22] topredict secreted proteins, SecretomeP [21] to predict non-classically secreted (i.e., no signalpeptide) proteins, and TMHMM [129] to predict proteins that contain transmembrane alpha-helices.

Gold standard datasets

Positive examples We gathered 1,028 human-HIV PPIs, from four public databases: theBiomolecular Interaction Network Database [80], the Database of Interacting Proteins [196],IntAct [96], and Reactome [117]. From these databases we also gathered 1,127 human-Hepatitis C, 284 human-Influenza virus, 177 human-Herpesvirus, and 97 human-PapillomavirusPPIs for use in the analysis of cross-prediction using a model trained on human-HIV PPIs.We also constructed a human intra-species PPI network containing 78,804 PPIs using thesefour databases along with three additional ones: the Human Protein Reference Database [159],the Molecular INTeraction database [246], and the Munich Information Center for ProteinSequences [86]. This intra-species network was used to compute each human protein’s degreeand centrality measurements.

Negative examples Unlike positive examples, true negative examples are difficult to de-fine since it is rare to find confirmed lists of non-interacting protein pairs. Since the number

103

of truly interacting pairs is likely to be far less than the total set of protein pairs, we generatenegative examples by randomly pairing H. sapiens and HIV proteins, after removing positiveexamples. We use the same procedure to generate negative examples in the other human-viral systems. In this study we explore the use of different true-positive to true-negative(TP:TN) ratios: 1:1, 1:10, 1:25, 1:50, and 1:100. While it is possible that the correct TP:TNratio may be lower than 1:100, exploring a range to TP:TN ratios allowed us to observeperformance trends and guided our choosing of other parameters. Additionally, we use lowerratios because human-pathogen PPI data is so sparse when compared to model organisms,resulting in a higher probability labeling a true interaction that has not been observed as anon-interacting.

7.2 Results

Our analysis contains three components. First, we compare the SVM results using differentfeature combinations to identify which subset of features achieves the best performance.Second, using this set of features, we predict PPIs between human and HIV proteins. Third,we explore using human-HIV PPIs to train a classifier and predict identify other human-viralPPIs.

7.2.1 Predictive feature sets

Using the features previously described in section 7.1.2, we compare and contrast the abilityof different sets of features to predict human-HIV PPIs.

k-mer size

First we identify which size k-mer preforms the best. Figures 7.1(a)–7.1(e) display theprecision/recall curves for each of the TP:TN ratios and for four different k-mer sizes: two,three, four, and five (see Table B.8 of the appendix for the corresponding accuracy values).We observe that all k-mer sizes performed well at a TP:TN ratio of 1:1. However, increasingthe ratio, even to 1:10, quickly decreases the performance of the two-mer and three-mermodels. For each of the TP:TN ratios, the four-mers and five-mers performs the best. Whilethe five-mer model has better precision at recall values < 0.25%, the four-mer model seemsto perform the best overall. The Area Under the Curve (AUC) values of the precision/recallcurves for the four-mer model ranged from 0.18 to 0.89. It is possible that four-mers inthis training sets represents a statistical optimum that balances too much noise (two-mersand three-mers) and over fitting (five-mers). Thus we use four-mers in all future analysesdescribed in this manuscript.

104

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall

2-mers3-mers4-mers5-mers

(a) 1:1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall


(b) 1:10

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall


(c) 1:25

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall


(d) 1:50

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall


(e) 1:100

Figure 7.1: Precision/recall curves using different k-mer sizes for different TP:TN ratios.

Feature combination

Figures 7.2(a)–7.2(e) display the precision/recall curves for each of the TP:TN ratios. Inorder to make these images more clear, we do not include the six combinations of features

105

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall

DDK

DKNDN

KKN

(a) 1:1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall

DDK

DKNDN

KKN

(b) 1:10

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall

DDK

DKNDN

KKN

(c) 1:25

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall

DDK

DKNDN

KKN

(d) 1:50

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall

DDK

DKNDN

KKN

(e) 1:100

Figure 7.2: Precision/recall curves using different feature combinations for different TP:TNratios.

involving the structural features (F), because subsequent analyses showed these features didnot improve the predictive power. We do include the corresponding accuracy values for

106

all twelve feature sets in Table B.8 of the appendix. At the lowest TP:TN ratio of 1:1 wefind that all feature combinations perform well with precision and recall > 0.80%, with theexception of the D and DF feature sets. As we increase the number of TNs used in trainingwe begin to see a general trend that the DKN feature combination seems to perform thebest, with precision and recall values > 0.70% at the 1:10, 1:25, and 1:50 ratios and precisionand recall values > 0.55% at the 1:100 ratio. The AUC values of the precision/recall curvesfor the DKN model range from 0.50 to 0.97.

7.2.2 Feature importance

An advantage of machine learning methods is the ability to identify those features that arebest able to predict PPIs. In the case of linear SVMs, we can use the resulting supportvectors of the decision plane to compute a features importance. For each TP:TN ratio,we construct the SVM model corresponding to the DKN feature set and compute eachfeature’s importance. Several of the domain pairs appear in the top ten features of all thedifferent TP:TN ratios. Examples of these include the human domain “Clathrin adaptor(IPR000804)” and the HIV domain “HIV negative factor Nef (IPR001558)”. The viral Nefprotein has been shown to play an important role in disrupting the AP2M1 clathrin adapterpathway by inducing the formation of clathrin-coated pits in the presence of CD4 in aneffort to accelerate the rate of endocytosis [72, 221]. Another example is the domain pairconsisting of the human domain “Four-helical cytokine, core (IPR012351)” and the HIVdomain “HIV transactivating regulatory protein Tat (IPR001831)”. Cytokines are a class ofproteins that are used extensively in cellular communication and signaling. The viral proteintat is known to play a critical role in the disruption of normal cell signaling pathways suchas the apoptotic pathway [48]. While no direct support was identified as to whether theseparticular interactions are mediated by the domains, these important features can act as thebasis for future studies of how HIV proteins interact with human proteins.

7.2.3 Literature-based validation of predicted PPIs

For each of the TP:TN ratios we use a model trained using domains, protein k-mers, andnetwork properties (DKN) to predict PPIs. Table 7.1 summarizes the number of predic-tions for each of the TP:TN ratios. Depending on the TP:TN ratio used in training wemake between 610 (1:100) and 10,994 (1:1) predictions. Amongst these predictions are be-tween 41.6% (1:100) and 98.9% (1:1) of our known TP PPIs. In the rest of this section,we discuss results for the 1:25 ratio. All of the predicting interactions are available in thesupplementary material.

We focus our discussion here on PPIs containing human proteins that have been recentlyidentified as being critical for the pathogenesis of HIV [33]. Brass et al. [33] used a large-scale small interfering RNA screen to identify more than 250 such HIV Dependency Factors

107

Table 7.1: Summary of the number of predicted interactions for the different TP:TN ratiosusing the DKN model. The dataset contained 1,028 TP PPIs, of which 74 involve HIVdependency factors.

TP:TN ratioFeature Set 1:1 1:10 1:25 1:50 1:100# predicted PPIs 10,994 3,932 1,856 1,118 610# TP predicted PPIs 1,017 863 745 612 428

(98.9%) (83.9%) (72.5%) (59.5%) 428 (41.6%)# known HDF PPIs 73 70 61 54 47

(98.6%) (94.6%) (82.4%) (73.0%) (63.5%)# new predicted HDF PPIs 224 86 46 33 16

(HDFs). HDFs may or may not directly interacting with HIV proteins. This may be onereason why we are not able to identify interacting partners for many HDFs. However, pre-dicting physical interactions between HDFs and HIV proteins may help guide experimentalstudies of how HDFs are vital for HIV pathogenesis. While the HDFs were identified pri-marily in HeLa cells, with only a few being validated in lymphocytes, they provide a greatstarting point for predicting PPIs involving critical human proteins. Our training set con-tained 74 known human-HIV PPIs involving HDFs. Using the DKN model we are able torecover 82.4% of these interactions. In addition to those PPIs within our training set, weare able to predict an additional 46 PPIs involving HDFs (see Table 7.1). Below we discusspredicted interactions, which illustrate HIV’s progression as it interacts with the host cell-membrane to gain entrance into the cell and subsequent interactions with human cytoplasmicproteins and eventual entrance into the host nucleus (see Figure 7.3).

CD4 positive helper T cells are the primary substrates of HIV. T cells are responsible foractivating and directing other immune cells that lack cytotoxic and phagocytic activities,meaning they can not kill infected cells or pathogens directly. HIV has evolved to makeuse of host membrane proteins, such as CD4 and CXCR4, involved in immune response togain entrance into the cell. HIV attaches to the host protein CD4, a T-cell glycoprotein,and subsequently to host chemokine receptors such as CXCR4. These binding events causeconformational changes to the host proteins that allow the membrane of the virus to fuse tothe host cell membrane [102]. We make several predictions for both the CD4 and CXCR4proteins, which are supported by the literature. Predicted interacting partners for the hostCD4 protein include the viral proteins Env, Vpu, and Nef. A physical attachment of thevirus to the host cell is mediated by the viral Env protein [198]. In the case of Vpu andNef the interactions have been linked to a depletion of CD4 proteins on the cell surface [43].Although the direct mechanism is not clear, it has been shown that down regulation of CD4is required for HIV infection [224]. One possible strategy of the virus in depleting the CD4levels is the downstream effect of blocking apoptotic pathways [12]. Examples of predictedinteracting partners for the host CXCR4 protein include the viral Tat and Env proteins.

108

Figure 7.3: Image taken with permission from Brass et al. [33] and adapted to show: 1)HDFs for which we make no predictions (red), 2) HDFs for which we make predictions usinga 1:1 or 1:10 ratio in training (blue), and 3) HDFs for which we make predictions using a1:25 ratio or higher during training (green).

These predictions are supported by direct experimental evidence [3, 13]. In the case of Env,this interaction is critical for viral fusion with the host cell [13].

We make predictions for other host protein known to be found on the cell surface andparticipate in various signaling pathways. For example, we predict interactions with hostHDFs Epidermal Growth Factor (EGF) and EGF Receptor. These proteins play a criticalrole in the regulation of cell growth, proliferation, and differentiation. In particular the EGF-Tat and EGFR-Gag interactions have been identified as key players in the manipulation ofthe EGF pathway [163, 234]. Both of these interactions were predicted by our algorithm.Finally, we predict that the host AP2M1 protein, a clathrin assembly protein, interactswith the viral Nef protein. Clathrin is a protein that is found in the clathrin-coated pits andvesicles formed during endocytosis of materials at the surface of the cell. This interaction hasbeen shown to accelerate the rate of endocytosis via the AP2M1 clathrin adaptor mediatedpathway [221].

109

HIV infection leads to low levels of CD4+ T cells. One method for reducing the numberof CD4+ cells is via apoptotic programming. Reduction in the number of CD4+ weakensthe host’s immune system and makes it more susceptible to infections. An example of aHDF involved in the regulation of cellular apoptosis is the host AKT1 protein. AKT1 isa general protein kinase known to phosphorylate and mediate the effects of several growthfactors including EGF. AKT1 also mediates the anti-apoptotic effects of Insulin-like growthfactor. One of the predicted interactors of AKT1 is the viral Tat protein. Recently it hasbeen shown that the viral Tat protein interacts with and activates AKT1 to promote cellularsurvival and persistent infection [46].

Also localized to the cytoplasm and predicted to interact with the HIV Tat protein arethe human HDFs PPP2R2A and PSME2 proteins. PPP2R2A is one of four major Ser/Thrphosphatases that plays a role in the negative control of the cell growth and division. Duringpathogenesis, the interaction between PPP2R2A and Tat has been observed to play a keyrole in Tat’s ability to act as a transcription factor in the increased production of viralmaterial [195]. Host PSME2 is a subunit of the protein complex responsible for activatingthe proteosome complex and enhancing the generation of major histocompatibility complexclass I binding peptides. Viral Tat has been shown to interfere with the antigen presentationvia this interaction [105, 200]. The leads to a failure of the human immune system torecognize HIV infected cells.

Since viruses lack the machinery needed to replicate their genomes, viral genetic materialmust first cross the barrier from the cytoplasm into the nucleus in order to make use of thehost’s transcriptional machinery. The nuclear pore is a large protein complex that spansthe nuclear membrane and allows for the transport of molecules across the nuclear envelopeincluding proteins and RNA. We predict interactions between several viral proteins andhost HDFs, which are known to be part of the nuclear pore including NUP107, NUP133,NUP153, NUP155, and NUP160. One of the predicted interactors is the viral Vpr protein.Viral Vpr has been shown to localize at the nuclear envelope and interact with several nuclearproteins [135]. This interaction has been linked with Vpr’s ability to drive the cell into G2cell cycle arrest resulting in the activation of apoptotic pathways [8]. We also predict thesehost HDFs to interact with the viral Tat protein. While no direct interaction has beenobserved between viral Tat and these nuclear pore proteins, Tat has been shown to possesa Nuclear Localization Sequence (NLS) and is capable of transporting material across thenuclear membrane through the nuclear pore [66]. Additionally, recent studies have shown asignificant change in the expression of various proteins found at the nuclear pore in responseto HIV infection [42]. Thus, the predicted interactions involving the viral Tat protein andthese host HDFs may be worthy candidates for experimental validation. In addition to Vprand Tat, we also predict the viral Pol as an interactor of these host HDFs. The c-terminalof the HIV Pol protein encodes for the viral integrase, which is an enzyme produced by aretrovirus (including HIV) that enables its genetic material to be integrated into the DNAof the infected cell. The viral integrase protein also contains an atypical NLS and theimport of this protein into the host nucleus is required prior to integration into the host

110

chromosome [75].

Within the nucleus of the host cell, we make several predictions involving host HDF proteins.It seems that one possible goal of these predicted interactions is to modulate and manipulatehost immune response pathways in order to assure continued survival of the virus. Oneexample that highlights this strategy is the predicted interaction between the host HDFRELA and the viral Vpr proteins. RELA is part of the NF-κB complex. NF-κB is atranscription factor that regulates many biological processes such as inflammation, immunity,and apoptosis. The interaction with Vpr has been shown to inhibit the nuclear translocationof NF-κB, thus preventing the host from mounting a successful immune response [236].

HIV also makes use of host proteins to drive the expression of its own genetic material. Onesuch example is the host CCNT1 protein, a cyclin T1. Cyclins function as regulators ofcyclin dependent kinases, which play an important role in cell cycle progression. Two ofthe predicted partners of CCNT1 are the viral proteins Tat and Vpr. The interaction ofCCNT1 and Tat has been shown to increase Tat’s affinity for the transactivating responseRNA element and serve as an essential co-factor for Tat, allowing the transcription of viralgenes [25]. At the same time, the viral Vpr protein has been shown to interact with CCNT1in tandom with viral Tat to modulate transcription of the viral genome [199]. HIV mustalso recruit host polymerases to translate viral genetic material. We predict an interactionbetween the human POLR3A, a DNA-dependent RNA polymerase, and the viral Tat protein.The HIV Tat protein has been shown to up regulate the transcription by POLR3A, leadingto a increased production of viral proteins [111].

7.2.4 Transferring host-pathogen PPIs

Given the large number of viruses that infect humans, it is not feasible to systematicallyidentify PPIs experimentally for each system. We assess how well a classifier trained usingonly human-HIV interactions can predict known PPIs in other human-viral systems. Apositive assessment sets the stage for systematic approaches to predict PPIs across host-pathogen systems, much like how intra-species PPIs have been transferred across modelorganisms [244].

We ask if a SVM trained using the DKN feature set and the entire set of human-HIVPPIs could predict the other human-viral PPIs correctly. Figures 7.4(a)–7.4(e) display theprecision/recall curves for each of the TP:TN ratios (see Table B.9 of the appendix for thecorresponding accuracy measurements). At the 1:1 ratio, we observe excellent performancefor all four pathogens with precision and recall both ≥ 70%. As we increase the TN ratio theperformance of Herpesvirus and Papillomavirus datasets rapidly decreases. The Hepatitis Cvirus and Influenza virus classification performs consistently well for all the ratios with 60%precision at 50% recall on the Hepatitis C dataset and 85% precision at 50% recall on theInfluenza dataset for the 1:25 TP:TN ratio.

111

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Precision/Recall - HIV Trained

Hepatitis C virusHerpesvirus

Influenza virusPapillomavirus

(a) 1:1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall




(b) 1:10

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall




(c) 1:25

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall




(d) 1:50

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall




(e) 1:100

Figure 7.4: Precision/Recall of a DKN model trained using human-HIV interactions fordifferent TP:TN ratios and used to predict in other viral systems.

While it is no surprise that better results are achieved at lower TP:TN ratios, this may alsorepresent an important feature of cross-predicting HP-PPIs. The genetic variation between

112

strains of the same virus can be great and even more so when comparing different viruses.Thus trying to cross-predict using a model trained with high TP:TN ratios in one viral systemmay be over-fitting the model to that particular pathogen, whereas lower TP:TN ratios mayallow enough variation to capture characteristics general to pathogenesis. Overall theseresults suggest that it promising to build cross-predictive models and such methods maybecome more applicable as more systems are studied in depth.

7.3 Conclusions

We have presented the first application, to our knowledge, of supervised machine learningmethods to predict human-pathogen PPIs. The goal of these predictive models is to identifyphysical interactions between human and pathogen proteins that are critical in pathogenesisand to guide cost-effective experimental studies. An important contribution of our work isthe comparison of different features and their combinations, and multiple TP:TN ratios. Weapplied our methodology to the human-HIV system and find that a model trained usingdomain-profiles, sequence four-mers, and networks characteristics of the human proteinsprovides the most accurate results. When used to predict PPIs between human proteinsknown to be critical for HIV infection, we are able to predict many interactions supportedby the literature. In addition, we demonstrated that it is possible to build a classifier basedon data for the human-HIV system and use this classifier to identify known PPIs in otherhuman-viral systems. A key future extension of this work is the integration of additionaltypes of data (e.g., gene expression) once the proper datasets become available.

Chapter 8

Outlooks and Perspectives

The field of system biology continues to be a rapidly growing area of scientific research.In particular the study of intra-species Protein Interaction Networks (PINs) continues toreceive more attention each year. For example, a simple query of PubMed reveals that thenumber of articles published on the PIN of the model organism Saccharomyces cerevisiae,has grown steadily over the past few years (see Figure 8). However, analytical and pre-dictive tools for host-pathogen systems are still in their infancy. The main hindrance stillremains the scarcity of the data, such as the case was for model organisms just a few yearsago. The advancement of the field of pathosystems biology will require the development ofnew experimental technologies. For example a significant hurdle in parallel gene expressionprofiles of both host and pathogen from the point of infection, remains the ability to acquireenough pathogen mRNA during early infection, which is the most important for identifyinganti-infection strategies.

The work presented here is the first aimed at the computational study of host-pathogen PINsand represents an important piece of the foundation of the field of computational pathosys-tems biology. There are still many critical questions that need to be addressed. For example,pharmaceutical companies are facing the daunting challenge of increasing innovation and pro-ductivity in their antiviral and antibacterial drug and vaccine pipelines. In particular, sinceviruses can mutate rapidly, some viruses are becoming resistant to even potent drugs. Re-cent studies suggest new potential strategies for treating infection by targeting host cellularproteins that are critical for viral infection [223]. Examples of Pathogen-Dependent HostFactors (PDHFs) have recently been identified for HIV [33]. Such host proteins are essentialfor the pathogenesis of HIV, but are not essential to the host cell. These proteins represent aclass of proteins that can then be mined for potential therapeutic strategies. Building on theresearch presented here, an important extension would be to build classifiers able to predictPDHFs. Identification of such proteins across families of pathogens holds the potential ofrevealing common PDHFs, which can result in the ability to treat multiple infections witha single treatment.

113

114

0

50

100

150

200

2000 2001 2002 2003 2004 2005 2006 2007

# of

Pub

licat

ions

Year

Figure 8.1: Number of publications in PubMed for studies on the protein interaction networkof the model organism Saccharomyces cerevisiae. The pubmed query was “Saccharomyces

cerevisiae AND protein interaction”, restricted by the publication year of interest.

Additionally, questions such as “What makes some pathogens pathogenic and other symbi-otic?” have yet to be addressed. Much like comparative network strategies used betweenintra-species PINs to study complex diseases [45, 146], comparative approaches of Host-Pathogen PINs (HP-PINs) holds the potential of identifying key distinguishing PPIs thataffect the phenotypic outcome of disease. These PPIs could then be studied in the contextof the HP-PIN to identify subnetworks or modules responsible for pathogenesis.

We have shown here that the field of pathosystems biology, in particular computationalpathosystems biology, holds great promise for shedding light on often poorly characterizedprocess of pathogenesis. As the focus of primary research switches from the view of a singleorganism to that of complex multi-organism systems, we believe the resources e.g. data, willcontinue to grow. These data, coupled with new algorithms and tools, will fuel the growthof pathosystems biology and provide critical insight into vaccine, drug, and therapeuticdevelopment and what makes a pathogen–a pathogen. Such tools and data will also allow usto consider how genetic variation between individuals correlates with phenotypic variationsand become a driving force behind the concept of specialized medicine.

Appendix A

Supplementary Figures

115

116

Figure A.1: Log-log scatter-plot of each protein contained within the whole human PPInetworks used in Chapter 3, which contains 11,463 proteins. The x-axis is the degree andthe y-axis is the centrality of a protein within its respective network.

117

Figure A.2: Log-log scatter-plot of each protein contained within the high-throughput humanPPI networks used in Chapter 3, which contains 4,986 proteins. The x-axis is the degreeand the y-axis is the centrality of a protein within its respective network.

118

Figure A.3: Log-log scatter-plot of each protein contained within the manually curatedhuman PPI networks used in Chapter 3, which contains 8,704 proteins. The x-axis is thedegree and the y-axis is the centrality of a protein within its respective network.

119

ALL-ALL-B

ALL-G-B

ALL-M-B

ALL-MP-B

ALL-MPS-B

ALL-MS-B

ALL-P-B

ALL-PS-B

ALL-S-B

G-B

M-B

MP-B

MPS-B

MS-B

P-B

PS-B

S-B

ALL-ALL-B

ALL-G-B

ALL-M-B

ALL-MP-B

ALL-MPS-B

ALL-MS-B

ALL-P-B

ALL-PS-B

ALL-S-B

G-BM-BMP-BMPS-B

MS-BP-BPS-BS-B

Figure A.4: Overlap of the predicted interactions resulting from the use of different filtersduring the training and testing steps and the application of a post-prediction BLAST-filter(see Chapter 6). ALL is the set of all proteins, G is the set of proteins annotated with GeneOntology functions of interest, M is the set of proteins containing transmembrane-domains,P is the set of proteins containing a signal peptide, and S is the set of proteins predicted tobe secreted, but do not contain a signal peptide. The first nine groups were trained usingall proteins, followed by the use of the filters, whereas the next eight groups were trainedand tested on sets of filtered proteins. Each cell is the Jaccard coefficient of the predictedinteractions between the corresponding row and column datasets. The Jaccard values rangefrom 0 (red) to 0.5 (black) to green (1).

120

ALL-ALL-B

ALL-G-B

ALL-M-B

ALL-MP-B

ALL-MPS-B

ALL-MS-B

ALL-P-B

ALL-PS-B

ALL-S-B

G-B

M-B

MP-B

MPS-B

MS-B

P-B

PS-B

S-B

ALL-ALL-B

ALL-G-B

ALL-M-B

ALL-MP-B

ALL-MPS-B

ALL-MS-B

ALL-P-B

ALL-PS-B

ALL-S-B

G-BM-BMP-BMPS-B

MS-BP-BPS-BS-B

Figure A.5: Overlap of the enriched functions amongst the predicted interactions resultingfrom the use of different filters during the training and testing steps and the applicationof a post-prediction BLAST-filter (see Chapter 6). ALL is the set of all proteins, G is theset of proteins annotated with Gene Ontology functions of interest, M is the set of proteinscontaining transmembrane-domains, P is the set of proteins containing a signal peptide, andS is the set of proteins predicted to be secreted, but do not contain a signal peptide. Thefirst nine groups were trained using all proteins, followed by the use of the filters, whereasthe next eight groups were trained and tested on sets of filtered proteins. Each cell is theJaccard coefficient of the predicted interactions between the corresponding row and columndatasets. The Jaccard values range from 0 (red) to 0.5 (black) to green (1).

Appendix B

Supplementary Tables

121

122

Table B.1: Relative occurrences of four types of nodes in each of the three networks usedin Chapter 3. The “Fraction” column defines the cutoff at which a protein is considered ahub or a bottleneck. The other columns represent the fraction of hub-bottleneck (H-B), non-hub-bottleneck (NH-B), hub-non-bottleneck (H-NB), and non-hub-non-bottleneck (NH-NB)proteins in the network using that cutoff.

Network Fraction HB NH-B H-NB NH-NB

W

0.10 0.001 0.039 0.000 0.9600.20 0.003 0.078 0.001 0.9180.30 0.007 0.115 0.002 0.8760.40 0.012 0.155 0.010 0.824

HT

0.10 0.002 0.033 0.000 0.9650.20 0.004 0.066 0.000 0.9300.30 0.006 0.100 0.000 0.8940.40 0.009 0.132 0.000 0.859

MC

0.10 0.001 0.043 0.001 0.9550.20 0.003 0.086 0.002 0.9090.30 0.008 0.127 0.007 0.8580.40 0.012 0.173 0.021 0.794

123

Table B.2: Summary of GSEA degree results with and without human-HIV PPIs fromChapter 3. The “#proteins in group” column displays the total number of human proteinsin that group. The “ES” column displays the enrichment score calculated by GSEA. Thecolumn titled “#proteins contributing” displays the number of proteins contributing to theES score. The column titled “Jaccard’s” lists the Jaccard coefficient between the two setsof proteins contributing to the ES score for degree and for centrality (see Table B.3 forcentrality results).

All Human-Pathogen PPIsDegree

Network Set #proteins ES # proteins p-value Jaccards’in group contributing

WVirus 1,029 0.79 563 < 1 × 10−6 0.51Multivirus 182 0.84 129 < 1 × 10−6 0.60Bacteria 108 0.76 65 3 × 10−5 0.45

HTVirus 466 0.68 156 < 1 × 10−6 0.53Multivirus 98 0.65 36 0.03 0.68Bacteria 43 0.79 13 2 × 10−3 0.66

MCVirus 958 0.78 523 < 1 × 10−6 0.47Multivirus 174 0.83 119 < 1 × 10−6 0.63Bacteria 100 0.73 66 3.4 × 10−4 0.51


WVirus 499 0.80 263 < 1 × 10−6 0.59Multivirus 81 0.83 60 < 1 × 10−6 0.66

HTVirus 267 0.70 93 1 × 10−6 0.57Multivirus 46 0.72 21 0.02 0.70

MCVirus 958 0.78 523 < 1 × 10−6 0.47Multivirus 174 0.83 119 < 1 × 10−6 0.63

124

Table B.3: Summary of GSEA centrality results with and without human-HIV PPIs fromChapter 3. The “#proteins in group” column displays the total number of human proteinsin that group. The “ES” column displays the enrichment score calculated by GSEA. Thecolumn titled “#proteins contributing” displays the number of proteins contributing to theES score. The column titled “Jaccard’s” lists the Jaccard coefficient between the two sets ofproteins contributing to the ES score for degree and for centrality (see Table B.2 for degreeresults).

All Human-Pathogen PPIsCentrality

Network Set #proteins ES #proteins p-value Jaccards’in group contributing

WVirus 1,029 0.83 349 < 1 × 10−6 0.51Multivirus 182 0.86 80 1.2 × 10−4 0.60Bacteria 108 0.89 31 2.3 × 10−4 0.45

HTVirus 466 0.82 108 1.5 × 10−3 0.53Multivirus 98 0.82 31 0.10 0.68Bacteria 43 0.89 12 0.02 0.66

MCVirus 958 0.80 308 < 1 × 10−6 0.47Multivirus 174 0.82 85 1.9 × 10−5 0.63Bacteria 100 0.85 37 9.2 × 10−4 0.51


WVirus 499 0.85 195 < 1 × 10−6 0.59Multivirus 81 0.88 43 3.6 × 10−3 0.66

HTVirus 267 0.84 66 6.12 × 10−4 0.57Multivirus 46 0.86 18 0.07 0.70

MCVirus 958 0.80 308 < 1 × 10−6 0.47Multivirus 174 0.85 85 1.3 × 10−3 0.63

125

Table B.4: Summary of GSEA degree results for individual pathogen groups from Chapter 3.The “#proteins in group” column displays the total number of human proteins in thatgroup. The “ES” column displays the enrichment score calculated by GSEA. The columntitled “#proteins contributing” displays the number of proteins contributing to the ES score.The column titled “Jaccard’s” lists the Jaccard coefficient between the two sets of proteinscontributing to the ES score for degree and for centrality (see Table B.5 for centrality results).

DegreeNetwork Group #proteins ES #proteins p-value Jaccards’

in group contributing

W

Adenovirus 59 0.79 34 2 × 10−4 0.63Chlamydia 19 0.79 18 0.03 0.27EBV 121 0.72 61 3.7 × 10−4 0.57Hepatitis 93 0.81 47 < 1 × 10−6 0.67Herpesvirus 54 0.80 32 1.1 × 10−4 0.70HIV 671 0.79 358 < 1 × 10−6 0.48Papillomavirus 94 0.79 66 1 × 10−6 0.67Polyomavirus 8 0.97 6 7 × 10−5 0.83Pseudomonas 9 0.91 7 4.2 × 10−3 0.88Sarcoma virus 35 0.81 31 1.1 × 10−3 0.65

HTEBV 64 0.76 17 8.9 × 10−4 0.83Herpesvirus 22 0.84 6 4.6 × 10−3 0.83

MC

Adenovirus 57 0.79 29 6.4 × 10−5 0.73EBV 110 0.70 56 3.1 × 10−3 0.57Hepatitis 89 0.79 44 < 1 × 10−6 0.64Herpesvirus 50 0.77 31 9.8 × 10−4 0.72HIV 620 0.77 358 < 1 × 10−6 0.44Influenza 76 0.94 67 < 1 × 10−6 0.47Papillomavirus 90 0.766 70 1.5 × 10−5 0.41Polyomavirus 7 0.97 6 6 × 10−6 0.83Pseudomonas 9 0.90 7 5.4 × 10−3 1Sarcoma virus 34 0.79 21 2.1 × 10−3 0.82

126

Table B.5: Summary of GSEA centrality results for individual pathogen groups from Chap-ter 3. The “#proteins in group” column displays the total number of human proteins in thatgroup. The “ES” column displays the enrichment score calculated by GSEA. The columntitled “#proteins contributing” displays the number of proteins contributing to the ES score.The column titled “Jaccard’s” lists the Jaccard coefficient between the two sets of proteinscontributing to the ES score for degree and for centrality (see Table B.4 for degree results).

CentralityNetwork Group #proteins ES #proteins p-value Jaccards’


W

Adenovirus 59 0.85 23 0.03 0.63Chlamydia 19 0.91 5 0.04 0.27EBV 121 0.85 41 4.8 × 10−3 0.57Hepatitis 93 0.91 33 5.8 × 10−5 0.67Herpesvirus 54 0.88 24 0.01 0.70HIV 671 0.81 218 4 × 10−6 0.48Papillomavirus 94 0.87 46 2.3 × 10−3 0.67Polyomavirus 8 0.96 5 0.03 0.83Pseudomonas 9 0.97 8 0.02 0.88Sarcoma virus 35 0.88 20 0.04 0.65

HTEBV 64 0.88 64 0.02 0.83Herpesvirus 22 0.95 22 7.8 × 10−3 0.83

MC

Adenovirus 57 0.86 23 7.6 × 10−3 0.73EBV 110 0.85 38 5.9 × 10−4 0.57Hepatitis 89 0.91 30 8 × 10−6 0.64Herpesvirus 50 0.84 24 0.02 0.72HIV 620 0.79 188 < 1 × 10−6 0.44Influenza 76 0.74 42 0.03 0.47Papillomavirus 90 0.85 30 2.2 × 10−3 0.41Polyomavirus 7 0.97 5 0.02 0.83Pseudomonas 9 0.94 7 0.03 1Sarcoma virus 34 0.87 19 0.02 0.82

127

Table B.6: Summary of GSEA results for protein degree of human proteins for three networksfrom Chapter 4; (W) whole human PPI network, (HT) the human PPI network generated byonly considering high-throughput experiments, and (C) the human PPI network generated byonly considering manually curated PPIs. The “# protein in group” displays the total numberof human proteins targeted. The “ES” column displays the enrichment score calculated bythe GSEA for degree. The column titled “# proteins contributing” displays the number ofproteins contributing to the ES score (see Table B.7 for centrality results).

DegreeNetwork Group #proteins ES #proteins p-value


W

B. anthracis 1,269 0.28 834 < 1 × 10−6

F. tularensis 729 0.28 574 < 1 × 10−6

Y. pestis 1,514 0.28 1,325 < 1 × 10−6

Multiple 828 0.31 579 < 1 × 10−6

All 241 0.32 187 < 1 × 10−6

HT

B. anthracis 608 0.39 608 < 1 × 10−6

F. tularensis 373 0.38 373 1 × 10−6

Y. pestis 723 0.39 723 1 × 10−6

Multiple 421 0.39 421 < 1 × 10−6

All 127 0.38 127 < 1 × 10−6

MC

B. anthracis 1,109 0.24 853 < 1 × 10−6

F. tularensis 637 0.24 500 < 1 × 10−6

Y. pestis 1,331 0.24 1,153 < 1 × 10−6

Multiple 733 0.28 596 < 1 × 10−6

All 214 0.30 165 < 1 × 10−6

128

Table B.7: Summary of GSEA results for protein betweenness centrality of human proteinsfor three networks from Chapter 4; (W) whole human PPI network, (HT) the human PPInetwork generated by only considering high-throughput experiments, and (C) the humanPPI network generated by only considering manually curated PPIs. The “# protein ingroup” displays the total number of human proteins targeted. The “ES” column displaysthe enrichment score calculated by the GSEA for centrality. The column titled “# proteinscontributing” displays the number of proteins contributing to the ES score (see Table B.6for degree results)

CentralityNetwork Group #proteins ES #proteins p-value


W

B. anthracis 1,269 0.46 1,269 < 1 × 10−6

F. tularensis 729 0.45 729 < 1 × 10−6

Y. pestis 1,514 0.47 1,514 < 1 × 10−6

Multiple 828 0.46 828 < 1 × 10−6

All 241 0.45 241 < 1 × 10−6

HT

B. anthracis 608 0.60 608 < 1 × 10−6

F. tularensis 373 0.59 373 < 1 × 10−6

Y. pestis 723 0.60 723 2 × 10−6

Multiple 421 0.60 421 < 1 × 10−6

All 127 0.59 127 2.9 × 10−5

MC

B. anthracis 1,109 0.41 1,109 < 1 × 10−6

F. tularensis 637 0.41 637 < 1 × 10−6

Y. pestis 1,331 0.42 1,331 < 1 × 10−6

Multiple 733 0.41 733 < 1 × 10−6

All 214 0.40 214 < 1 × 10−6

129

Table B.8: Accuracy values for each feature set described in Chapter 7. Unless statedotherwise, the feature K corresponds to four-mers.

TP:TN ratioFeature Set 1:1 1:10 1:25 1:50 1:100D 79.44 94.36 97.10 98.38 99.06DF 80.60 94.12 96.98 98.28 99.04DFK 87.88 95.21 97.46 98.53 99.11DFKN 93.11 96.23 97.80 98.66 99.16DFN 92.63 96.11 97.30 98.18 98.91DK 87.93 94.86 97.37 98.53 99.15DKN 92.68 96.32 97.81 98.65 99.20DN 92.72 95.74 97.05 98.19 99.00FK 85.79 92.28 96.40 98.12 99.03FKN 92.24 95.06 97.07 98.26 99.07K (two-mers) 75.32 90.89 96.14 98.03 99.01K (three-mers) 82.06 90.90 96.14 98.03 99.01K 85.65 92.60 96.38 98.09 99.03K (five-mers) 76.77 92.20 96.59 98.19 99.08KN 92.24 94.89 97.04 98.31 99.05

130

Table B.9: Accuracy, precision, and recall values for each human-viral systems using theDKN model trained with human-HIV PPIs described in Chapter 7.

TP:TN ratioPathogen Set 1:1 1:10 1:25 1:50 1:100Influenza virus 97.18 98.40 98.23 98.43 99.06Hepatitis C virus 91.39 95.36 97.41 98.61 99.21Herpesvirus 71.47 91.58 95.91 97.74 98.85Papillomavirus 79.90 93.16 96.23 97.94 98.99

Bibliography

[1] R. al Daccak, K. Mehindate, J. Hebert, L. Rink, S. Mecheri, and W. Mourad. My-

coplasma arthritidis-derived superantigen induces proinflammatory monokine gene ex-pression in the THP-1 human monocytic cell line. Infect Immun, 62(6):2409–2416,1994.

[2] R. Albert, H. Jeong, and A. L. Barabasi. Error and attack tolerance of complexnetworks. Nature, 406:378–382, 2000.

[3] A. Albini, S. Ferrini, R. Benelli, S. Sforzini, D. Giunciuglio, M. G. Aluigi, A. E.Proudfoot, S. Alouani, T. N. Wells, G. Mariani, R. L. Rabin, J. M. Farber, andD. M. Noonan. Hiv-1 Tat protein mimicry of chemokines. Proc Natl Acad Sci U S A,95(22):13153–13158, 1998.

[4] E. Almaas. Biological impacts and context of network theory. J Exp Biol, 210(Pt9):1548–1558, 2007.

[5] P. Aloy, H. Ceulemans, A. Stark, and R. B. Russell. The relationship between sequenceand interaction divergence in proteins. J Mol Biol, 332(5):989–998, 2003.

[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs. Nucleic Acids Res, 25(17):3389–402, 1997.

[7] C. Ambrosino, C. Palmieri, A. Puca, F. Trimboli, M. Schiavone, F. Olimpico, M. R.Ruocco, F. di Leva, M. Toriello, I. Quinto, S. Venuta, and G. Scala. Physical and func-tional interaction of HIV-1 Tat with E2F-4, a transcriptional regulator of mammaliancell cycle. J Biol Chem, 277(35):31448–31458, 2002.

[8] J. L. Andersen, J. L. DeHart, E. S. Zimmerman, O. Ardon, B. Kim, G. Jacquot,S. Benichou, and V. Planelles. Hiv-1 Vpr-induced apoptosis is cell cycle dependentand requires Bax but not ANT. PLoS Pathog, 2(12):e127, 2006.

[9] Z. Ao, G. Huang, H. Yao, Z. Xu, M. Labine, A. W. Cochrane, and X. Yao. Interaction ofhuman immunodeficiency virus type 1 integrase with cellular nuclear import receptorimportin 7 and its impact on viral replication. J Biol Chem, 282(18):13456–13467,2007.

131

132

[10] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver,A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, andG. Sherlock. Gene Ontology: tool for the unification of biology. The Gene Ontologyconsortium. Nat Genet, 25(1):25–9, 2000.

[11] A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro,E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale,C. O’Donovan, N. Redaschi, and L. S. Yeh. The Universal Protein Resource (UniProt).Nucleic Acids Res, 33:D154–9, 2005.

[12] N. K. Banda, J. Bernier, D. K. Kurahara, R. Kurrle, N. Haigwood, R. P. Sekaly, andT. H. Finkel. Crosslinking CD4 by human immunodeficiency virus gp120 primes t cellsfor activation-induced apoptosis. J Exp Med, 176(4):1099–1106, 1992.

[13] S. Bar and M. Alizon. Role of the ectodomain of the gp41 transmembrane envelopeprotein of human immunodeficiency virus type 1 in late steps of the membrane fusionprocess. J Virol, 78(2):811–820, 2004.

[14] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science,286:509–512, 1999.

[15] A. Barsky, J. L. Gardy, R. E. W. Hancock, and T. Munzner. Cerebral: a cytoscape plu-gin for layout of and interaction with biological networks using subcellular localizationannotation. Bioinformatics, 23(8):1040–1042, 2007.

[16] P. L. Bartel, J. A. Roecklein, D. SenGupta, and S. Fields. A protein linkage map ofEscherichia coli bacteriophage T7. Nat Genet, 12(1):72–77, 1996.

[17] S. R. Bartz, M. E. Rogel, and M. Emerman. Human immunodeficiency virus type 1cell cycle control: Vpr is cytostatic and mediates G2 accumulation by a mechanismwhich differs from DNA damage checkpoint control. J Virol, 70(4):2324–2331, 1996.

[18] D. I. Baruch, J. A. Gormely, C. Ma, R. J. Howard, and B. L. Pasloske. Plasmodium

falciparum erythrocyte membrane 1 is a parasitized erythrocyte receptor for adherenceto CD36, thrombospondin, and intracellular adhesion molecule 1. Proc Natl Acad Sci

U S A, 93(8):3497–3502, 1996.

[19] N. N. Batada, L. D. Hurst, and M. Tyers. Evolutionary and physiological importanceof hub proteins. PLoS Comput Biol, 2(7):e88, 2006.

[20] J. Baum, A. G. Maier, R. T. Good, K. M. Simpson, and A. F. Cowman. Invasion by P.

falciparum merozoites suggests a hierarchy of molecular interactions. PLoS Pathogens,1(4):e37, 2005.

133

[21] J. D. Bendtsen, L. Kiemer, A. Fausboll, and S. Brunak. Non-classical protein secretionin bacteria. BMC Microbiol, 5:58, 2005.

[22] J. D. Bendtsen, H. Nielsen, G. von Heijne, and S. Brunak. Improved prediction ofsignal peptides: Signalp 3.0. J Mol Biol, 340(4):783–795, 2004.

[23] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate - a practical andpowerful approach to multiple testing. J Roy Stat Soc B Met, 57(1):289–300, 1995.

[24] A. Bernat, N. Avvakumov, J. S. Mymryk, and L. Banks. Interaction between the HPVE7 oncoprotein and the transcriptional coactivator p300. Oncogene, 22(39):7871–7881,2003.

[25] P. D. Bieniasz, T. A. Grdina, H. P. Bogerd, and B. R. Cullen. Recruitment of cyclinT1/P-TEFb to an HIV type 1 long terminal repeat promoter proximal rna target isboth necessary and sufficient for full activation of transcription. Proc Natl Acad Sci U

S A, 96(14):7791–7796, 1999.

[26] C. Birago, V. Albanesi, F. Silvestrini, L. Picci, E. Pizzi, P. Alano, T. Pace, andM. Ponzi. A gene-family encoding small exported proteins is conserved across plas-modium genus. Mol Biochem Parasitol, 126(2):209–218, 2003.

[27] A. Blindenbacher, F. H. T. Duong, L. Hunziker, S. T. D. Stutvoet, X. Wang, L. Ter-racciano, D. Moradpour, H. E. Blum, T. Alonzi, M. Tripodi, N. La Monica, and M. H.Heim. Expression of hepatitis c virus proteins inhibits interferon alpha signaling in theliver of transgenic mice. Gastroenterology, 124(5):1465–1475, 2003.

[28] L. Borio, D. Frank, V. Mani, C. Chiriboga, M. Pollanen, M. Ripple, S. Ali, C. DiAngelo,J. Lee, J. Arden, J. Titus, D. Fowler, T. O’Toole, H. Masur, J. Bartlett, and T. Inglesby.Death due to bioterrorism-related inhalational anthrax: report of 2 patients. JAMA,286(20):2554–2559, 2001.

[29] C. M. Bosio, H. Bielefeldt-Ohmann, and J. T. Belisle. Active suppression of thepulmonary immune response by Francisella tularensis Schu4. J Immunol, 178(7):4538–4547, 2007.

[30] P. M. Bowers, M. Pellegrini, M. J. Thompson, J. Fierro, T. O. Yeates, and D. Eisenberg.Prolinks: a database of protein functional linkages derived from coevolution. Genome

Biol, 5(5):R35, 2004.

[31] Z. Bozdech, M. Llinas, B. L. Pulliam, E. D. Wong, J. Zhu, and J. L. DeRisi. Thetranscriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum.PLoS Biol, 1(1), 2003.

[32] U. Brandes. A faster algorithm for betweenness centrality. Mathematical Sociology,25(2):163–177, 2001.

134

[33] A. L. Brass, D. M. Dykxhoorn, Y. Benita, N. Yan, A. Engelman, R. J. Xavier, J. Lieber-man, and S. J. Elledge. Identification of host proteins required for HIV infectionthrough a functional genomic screen. Science, 319(5865):921–926, 2008.

[34] A. Brehm, S. J. Nielsen, E. A. Miska, D. J. McCance, J. L. Reid, A. J. Bannister,and T. Kouzarides. The e7 oncoprotein associates with mi2 and histone deacetylaseactivity to promote cell growth. EMBO J, 18(9):2449–2458, 1999.

[35] L. Burger and E. van Nimwegen. Accurate prediction of protein-protein interactionsfrom sequence alignments using a bayesian method. Mol Syst Biol, 4:165, 2008.

[36] M. A. Calderwood, K. Venkatesan, L. Xing, M. R. Chase, A. Vazquez, A. M. Holthaus,A. E. Ewence, N. Li, T. Hirozane-Kishikawa, D. E. Hill, M. Vidal, E. Kieff, andE. Johannsen. Epstein-Barr virus and virus human protein interaction maps. PNAS,104(18):7606–7611, 2007.

[37] G. R. Campbell, E. Pasquier, J. Watkins, V. Bourgarel-Rey, V. Peyrot, D. Esquieu,P. Barbier, J. de Mareuil, D. Braguer, P. Kaleebu, D. L. Yirrell, and E. P. Loret. Theglutamine-rich region of the HIV-1 Tat protein is involved in T-cell apoptosis. J Biol

Chem, 279(46):48197–48204, 2004.

[38] C. Caretta-Cartozo, P. De Los Rios, F. Piazza, and P. Lio. Bottleneck genes and com-munity structure in the cell cycle network of s. pombe. PLoS Comput Biol, 3(6):e103,2007.

[39] E. Carrillo, E. Garrido, and P. Gariglio. Specific in vitro interaction between papillo-mavirus E2 proteins and TBP-associated factors. Intervirology, 47(6):342–349, 2004.

[40] M. Carrolo, S. Giordano, L. Cabrita-Santos, S. Corso, A. M. Vigario, S. Silva,P. Leiriao, D. Carapau, R. Armas-Portela, P. M. Comoglio, A. Rodriguez, and M. M.Mota. Hepatocyte growth factor and its receptor are required for malaria infection.Nature Medicine, 9:1363 – 1369, 2003.

[41] T. Cathomen and M. D. Weitzman. A functional complex of adenovirus proteins E1B-55kDa and E4orf6 is necessary to modulate the expression level of p53 but not itstranscriptional activity. J Virol, 74(23):11407–11412, 2000.

[42] E. Y. Chan, W.-J. Qian, D. L. Diamond, T. Liu, M. A. Gritsenko, M. E. Monroe,D. G. n. Camp, R. D. Smith, and M. G. Katze. Quantitative analysis of humanimmunodeficiency virus type 1-infected CD4+ cell proteome: dysregulated cell cycleprogression and nuclear transport coincide with robust virus production. J Virol,81(14):7571–7583, 2007.

[43] B. K. Chen, R. T. Gandhi, and D. Baltimore. Cd4 down-modulation during infectionof human t cells with human immunodeficiency virus type 1 involves independentactivities of vpu, env, and nef. J Virol, 70(9):6044–6053, 1996.

135

[44] C. Chiao, T. Bader, J. E. Stenger, W. Baldwin, J. Brady, and J. C. Barrett. HIVtype 1 Tat inhibits tumor necrosis factor alpha-induced repression of tumor necrosisfactor receptor p55 and amplifies tumor necrosis factor alpha activity in stably Tat-transfected HeLa Cells. AIDS Res Hum Retroviruses, 17(12):1125–1132, 2001.

[45] H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker. Network-based classificationof breast cancer metastasis. Mol Syst Biol, 3:140, 2007.

[46] P. Chugh, S. Fan, V. Planelles, S. B. Maggirwar, S. Dewhurst, and B. Kim. Infection ofhuman immunodeficiency virus and intracellular viral tat protein exert a pro-survivaleffect in a human microglial cell line. J Mol Biol, 366(1):67–81, 2007.

[47] K. M. Chung, J. Lee, J. E. Kim, O. K. Song, S. Cho, J. Lim, M. Seedorf, B. Hahm,and S. K. Jang. Nonstructural protein 5A of hepatitis C virus inhibits the function ofkaryopherin beta3. J Virol, 74(11):5233–5241, 2000.

[48] A. Cossarizza. Apoptosis and HIV infection: about molecules and genes. Curr Pharm

Des, 14(3):237–244, 2008.

[49] S. Coulomb, M. Bauer, D. Bernard, and M.-C. Marsolier-Kergoat. Gene essentialityand the topology of protein interaction networks. Proc Biol Sci, 272(1573):1721–1725,2005.

[50] A. F. Cowman and B. S. Crabb. Invasion of red blood cells by malaria parasites. Cell,124:755–766, 2006.

[51] P. Cristofaro and S. M. Opal. Role of Toll-like receptors in infection and immunity:clinical implications. Drugs, 66(1):15–29, 2006.

[52] C. U. P. Da Costa, N. Wantia, C. J. Kirschning, D. H. Busch, N. Rodriguez, H. Wagner,and T. Miethke. Heat shock protein 60 from Chlamydia pneumoniae elicits an unusualset of inflammatory responses via Toll-like receptor 2 and 4 in vivo. Eur J Immunol,34(10):2874–2884, 2004. Comparative Study.

[53] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: a finger-print of proteins that physically interact. Trends Biochem Sci, 23(9):324–328, 1998.

[54] M. S. Darshan, J. Lucchi, E. Harding, and J. Moroianu. The l2 minor capsid proteinof human papillomavirus type 16 interacts with a network of nuclear import receptors.J Virol, 78(22):12179–12188, 2004.

[55] F. P. Davis, D. T. Barkan, N. Eswar, J. H. McKerrow, and A. Sali. Host pathogenprotein interactions predicted by comparative modeling. Protein Sci, 16(12):2585–2596, 2007.

136

[56] A. De Luca, R. Mangiacasale, A. Severino, L. Malquori, A. Baldi, A. Palena, A. M.Mileo, P. Lavia, and M. G. Paggi. E1A deregulates the centrosome cycle in a RanGTPase-dependent manner. Cancer Res, 63(6):1430–1437, 2003.

[57] M. Deng, S. Mehta, F. Sun, and T. Chen. Inferring domain-domain interactions fromprotein-protein interactions. Genome Res, 12(10):1540–1548, 2002.

[58] T. Dobner, N. Horikoshi, S. Rubenwolf, and T. Shenk. Blockage by adenovirus E4orf6of transcriptional activation by the p53 tumor suppressor. Science, 272(5267):1470–1473, 1996.

[59] J. C. Dorsman, A. F. Teunisse, A. Zantema, and A. J. van der Eb. The adenovirus12 E1A proteins can bind directly to proteins of the p300 transcription co-activatorfamily, including the CREB-binding protein CBP and p300. J Gen Virol, 78 ( Pt2):423–426, 1997.

[60] R. Dunn, F. Dudbridge, and C. M. Sanderson. The use of edge-betweenness clusteringto investigate biological function in protein interaction networks. BMC Bioinformatics,6:39, 2005.

[61] S. Dupuis, E. Jouanguy, S. Al-Hajjar, C. Fieschi, I. Z. Al-Mohsen, S. Al-Jumaah,K. Yang, A. Chapgier, C. Eidenschenk, P. Eid, A. Al Ghonaium, H. Tufenkeji,H. Frayha, S. Al-Gazlan, H. Al-Rayes, R. D. Schreiber, I. Gresser, and J.-L. Casanova.Impaired response to interferon-alpha/beta and lethal viral disease in human STAT1deficiency. Nat Genet, 33(3):388–391, 2003. Case Reports.

[62] J. Dutkowski and J. Tiuryn. Identification of functional modules from conserved an-cestral protein-protein interactions. Bioinformatics, 23(13):i149–58, 2007.

[63] M. D. Dyer, T. M. Murali, and B. W. Sobral. Computational prediction of host-pathogen protein protein interactions. Bioinformatics, 23(13):i159–i166, 2007.

[64] M. D. Dyer, T. M. Murali, and B. W. Sobral. The landscape of human proteinsinteracting with viruses and other pathogens. PLoS Pathog, 4(2):e32, 2008.

[65] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI geneexpression and hybridization array data repository. Nucleic Acids Res, 30(1):207–210,2002.

[66] A. Efthymiadis, L. J. Briggs, and D. A. Jans. The HIV-1 Tat nuclear localizationsequence confers novel nuclear import properties. J Biol Chem, 273(3):1623–1628,1998.

[67] J. Ellis, P. C. F. Oyston, M. Green, and R. W. Titball. Tularemia. Clin Microbiol

Rev, 15(4):631–646, 2002.

137

[68] R. M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. McBroom-Cerajewski,M. D. Robinson, L. O’Connor, M. Li, R. Taylor, M. Dharsee, Y. Ho, A. Heilbut,L. Moore, S. Zhang, O. Ornatsky, Y. V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu,M. Abu-Farha, J.-P. Lambert, H. S. Duewel, I. I. Stewart, B. Kuehl, K. Hogue, K. Col-will, K. Gladwish, B. Muskat, R. Kinach, S.-L. Adams, M. F. Moran, G. B. Morin,T. Topaloglou, and D. Figeys. Large-scale mapping of human protein-protein interac-tions by mass spectrometry. Mol Syst Biol, 3:89, 2007.

[69] M. Filippova, L. Parkhurst, and P. J. Duerksen-Hughes. The human papillomavirus 16E6 protein binds to Fas-associated death domain and protects cells from Fas-triggeredapoptosis. J Biol Chem, 279(24):25729–25744, 2004.

[70] J. Flannick, A. Novak, B. S. Srinivasan, H. H. McAdams, and S. Batzoglou. Graemlin:general and robust alignment of multiple large interaction networks. Genome Res,16(9):1169–1181, 2006.

[71] E. A. Fortunato, M. H. Sommer, K. Yoder, and D. H. Spector. Identification of domainswithin the human cytomegalovirus major immediate-early 86-kilodalton protein andthe retinoblastoma protein required for physical and functional interaction with eachother. J Virol, 71(11):8176–8185, 1997.

[72] M. Foti, A. Mangasarian, V. Piguet, D. P. Lew, K. H. Krause, D. Trono, and J. L.Carpentier. Nef-mediated clathrin-coated pit formation. J Cell Biol, 139(1):37–47,1997.

[73] H. B. Fraser, A. E. Hirsh, L. M. Steinmetz, C. Scharfe, and M. W. Feldman. Evolu-tionary rate in the protein interaction network. Science, 296(5568):750–752, 2002.

[74] L. C. Freeman. Set of measures of centrality based on betweenness. Sociometry, 40:35–41, 1977.

[75] P. Gallay, T. Hope, D. Chin, and D. Trono. Hiv-1 infection of nondividing cells throughthe recognition of integrase by the importin/karyopherin pathway. Proc Natl Acad Sci

U S A, 94(18):9825–9830, 1997.

[76] A.-C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J.Jensen, S. Bastuck, B. Dumpelfeld, A. Edelmann, M.-A. Heurtier, V. Hoffman, C. Hoe-fert, K. Klein, M. Hudak, A.-M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi,S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M.Rick, B. Kuster, P. Bork, R. B. Russell, and G. Superti-Furga. Proteome surveyreveals modularity of the yeast cell machinery. Nature, 440(7084):631–636, 2006.

[77] A.-C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz,J. M. Rick, A.-M. Michon, C.-M. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Bra-jenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau,

138

A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M.-A. Heurtier, R. R. Copley, A. Edel-mann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork,B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional organi-zation of the yeast proteome by systematic analysis of protein complexes. Nature,415(6868):141–147, Jan 2002.

[78] J. V. Geisberg, J. L. Chen, and R. P. Ricciardi. Subregions of the adenovirus E1Atransactivation domain target multiple components of the TFIID complex. Mol Cell

Biol, 15(11):6283–6290, 1995.

[79] J. V. Geisberg, W. S. Lee, A. J. Berk, and R. P. Ricciardi. The zinc finger region ofthe adenovirus E1A transactivating domain complexes with the TATA box bindingprotein. Proc Natl Acad Sci U S A, 91(7):2488–2492, 1994.

[80] D. Gilbert. Biomolecular Interaction Network Database. Brief Bioinform, 6(2):194–8,2005.

[81] L. Giot, J. S. Bader, C. Brouwer, A. Chaudhuri, B. Kuang, Y. Li, Y. L. Hao, C. E.Ooi, B. Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh,Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone, A. Collis, M. Minto, S. Burgess, L. Mc-Daniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime, M. Agee, E. Voss,K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky,A. DaSilva, J. Zhong, C. A. Stanyon, R. L. J. Finley, K. P. White, M. Braverman,T. Jarvie, S. Gold, M. Leach, J. Knight, R. A. Shimkets, M. P. McKenna, J. Chant,and J. M. Rothberg. A protein interaction map of Drosophila melanogaster . Science,302(5651):1727–1736, Dec 2003.

[82] M. Girvan and M. E. J. Newman. Community structure in social and biological net-works. Proc Natl Acad Sci U S A, 99(12):7821–7826, 2002.

[83] F. Grande, A. Garofalo, and N. Neamati. Small molecules anti-hiv therapeutics tar-geting cxcr4. Curr Pharm Des, 14(4):385–404, 2008.

[84] A. Grigoriev. A relationship between gene expression and protein interaction on theproteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cere-

visiae. Nucleic Acids Res, 29(17):3513–3519, 2001.

[85] S. Grossman, S. Baur, P. N. Robinson, and M. Vingron. An improved statistic fordetecting over-represented gene ontology annotations in gene sets. In A. Apostolico,editor, RECOMB 2006, pages 85–98. Springer-Verlag Berlin, Heidelberg, 2006.

[86] U. Guldener, M. Munsterkotter, M. Oesterheld, P. Pagel, A. Ruepp, H. W. Mewes,and V. Stumpflen. Mpact: the MIPS protein interaction resource on yeast. Nucleic

Acids Res, 34:D436–41, 2006.

139

[87] K. C. Gunsalus, H. Ge, A. J. Schetter, D. S. Goldberg, J.-D. J. Han, T. Hao, G. F.Berriz, N. Bertin, J. Huang, L.-S. Chuang, N. Li, R. Mani, A. A. Hyman, B. Son-nichsen, C. J. Echeverri, F. P. Roth, M. Vidal, and F. Piano. Predictive models ofmolecular machines involved in Caenorhabditis elegans early embryogenesis. Nature,436(7052):861–865, 2005.

[88] R. Haase, K. Richter, G. Pfaffinger, G. Courtois, and K. Ruckdeschel. Yersinia outerprotein P suppresses TGF-beta-activated kinase-1 activity to impair innate immunesignaling in Yersinia enterocolitica-infected cells. J Immunol, 175(12):8209–8217, 2005.

[89] M. W. Hahn and A. D. Kern. Comparative genomics of centrality and essentiality inthree eukaryotic protein-interaction networks. Mol Biol Evol, 22(4):803–806, 2005.

[90] J.-D. J. Han, N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy,A. J. M. Walhout, M. E. Cusick, F. P. Roth, and M. Vidal. Evidence for dynami-cally organized modularity in the yeast protein-protein interaction network. Nature,430(6995):88–93, 2004.

[91] J.-D. J. Han, D. Dupuy, N. Bertin, M. E. Cusick, and M. Vidal. Effect of sam-pling on topology predictions of protein-protein interaction networks. Nat Biotechnol,23(7):839–844, 2005.

[92] D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100(1):57–70, 2000.

[93] F. Hayashi, K. D. Smith, A. Ozinsky, T. R. Hawn, E. C. Yi, D. R. Goodlett, J. K. Eng,S. Akira, D. M. Underhill, and A. Aderem. The innate immune response to bacterialflagellin is mediated by Toll-like receptor 5. Nature, 410(6832):1099–1103, 2001.

[94] X. He and J. Zhang. Why do hubs tend to be essential in protein networks? PLoS

Genet, 2(6):e88, 2006.

[95] B. R. Henderson and P. Percipalle. Interactions between HIV Rev and nuclear importand export factors: the Rev nuclear localisation signal mediates specific binding tohuman importin-beta. J Mol Biol, 274(5):693–707, 1997.

[96] H. Hermjakob, L. Montecchi-Palazzi, C. Lewington, S. Mudali, S. Kerrien, S. Or-chard, M. Vingron, B. Roechert, P. Roepstorff, A. Valencia, H. Margalit, J. Armstrong,A. Bairoch, G. Cesareni, D. Sherman, and R. Apweiler. IntAct: an open source molec-ular interaction database. Nucleic Acids Res, 32:D452–5, 2004.

[97] N. Hiller, S. Bhattacharjee, C. V. Ooij, K. Liolios, T. Harrison, C. Lopez-Estrano, andK. Haldar. A host-targeting signal in virulence proteins reveals a secretome in malarialinfection. Science, 306:1934–1937, 2004.

[98] A. Hintze and C. Adami. Evolution of complex modular biological networks. PLoS

Comput Biol, 4(2):e23, 2008.

140

[99] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S.-L. Adams, A. Millar,P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff,J. Shewnarane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar,Z. Lin, K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R.Andersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen,J. Crawford, V. Poulsen, B. D. Sorensen, J. Matthiesen, R. C. Hendrickson, F. Glee-son, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys, andM. Tyers. Systematic identification of protein complexes in Saccharomyces cerevisiae

by mass spectrometry. Nature, 415(6868):180–183, Jan 2002.

[100] W. H. Hoffman, S. Biade, J. T. Zilfou, J. Chen, and M. Murphy. Transcriptional repres-sion of the anti-apoptotic survivin gene by wild type p53. J Biol Chem, 277(5):3247–3257, 2002.

[101] D. C. Hoyle, M. Rattray, R. Jupp, and A. Brass. Making sense of microarray datadistributions. Bioinformatics, 18(4):576–584, 2002.

[102] L. Huang, I. Bosch, W. Hofmann, J. Sodroski, and A. B. Pardee. Tat protein induceshuman immunodeficiency virus type 1 (HIV-1) coreceptors and promotes infection withboth macrophage-tropic and T-lymphotropic HIV-1 strains. J Virol, 72(11):8952–8960,1998.

[103] S. Huang, W. H. Lee, and E. Y. Lee. A cellular protein that competes with SV40 Tantigen for binding to the retinoblastoma gene product. Nature, 350(6314):160–162,1991.

[104] T. Huang, C. Lin, and C. Kao. Reconstruction of human protein interolog networkusing evolutionary conserved network. BMC Bioinformatics, 8(1):152, 2007.

[105] X. Huang, U. Seifert, U. Salzmann, P. Henklein, R. Preissner, W. Henke, A. J. Sijts,P. M. Kloetzel, and W. Dubiel. The RTP site shared by the HIV-1 Tat protein and the11S regulator subunit alpha is crucial for their effects on proteasome function includingantigen processing. J Mol Biol, 323(4):771–782, 2002.

[106] K.-W. Huh, J. DeMasi, H. Ogawa, Y. Nakatani, P. M. Howley, and K. Munger. Associ-ation of the human papillomavirus type 16 E7 oncoprotein with the 600-kDa retinoblas-toma protein-associated factor, p600. Proc Natl Acad Sci U S A, 102(32):11492–11497,2005.

[107] T. Ideker and R. Sharan. Protein networks in disease. Genome Res, 18(4):644–652,2008.

[108] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehensivetwo-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S

A, 98(8):4569–4574, Apr 2001.

141

[109] T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto,S. Kuhara, and Y. Sakaki. Toward a protein-protein interaction map of the bud-ding yeast: A comprehensive system to examine two-hybrid interactions in all possiblecombinations between the yeast proteins. Proc Natl Acad Sci U S A, 97(3):1143–1147,Feb 2000.

[110] E. Izmailova, F. M. N. Bertley, Q. Huang, N. Makori, C. J. Miller, R. A. Young, andA. Aldovini. HIV-1 Tat reprograms immature dendritic cells to express chemoattrac-tants for activated T cells and macrophages. Nat Med, 9(2):191–197, 2003.

[111] K. L. Jang, M. K. Collins, and D. S. Latchman. The human immunodeficiency virustat protein increases the transcription of human Alu repeated sequences by increasingthe activity of the cellular transcription factor TFIIIC. J Acquir Immune Defic Syndr,5(11):1142–1147, 1992.

[112] R. Jansen, D. Greenbaum, and M. Gerstein. Relating whole-genome expression datawith protein-protein interactions. Genome Res, 12:37–46, 2002.

[113] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N. Krogan, S. Chung, A. Emili, M. Sny-der, J. Greenblatt, and M. Gerstein. A Bayesian networks approach for predictingprotein-protein interactions from genomic data. Science, 302:449–453, 2003.

[114] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality inprotein networks. Nature, 411(6833):41–42, 2001.

[115] Z. Jiao, W. Wang, R. Jia, J. Li, H. You, L. Chen, and Y. Wang. Accumulation offoxp3-expressing cd4+cd25+ t cells with distinct chemokine receptors in synovial fluidof patients with active rheumatoid arthritis. Scand J Rheumatol, 36(6):428–433, 2007.

[116] T. Joachims. Making large-scale SVM learning practical. Advances is kernel methods

- support vector learning. MIT-Press, 1999.

[117] G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D’Eustachio, E. Schmidt, B. de Bono, B. Jas-sal, G. R. Gopinath, G. R. Wu, L. Matthews, S. Lewis, E. Birney, and L. Stein. RE-ACTOME: a knowledgebase of biological pathways. Nucleic Acids Res, 33:D428–32,2005.

[118] M. P. Joy, A. Brock, D. E. Ingber, and S. Huang. High-betweenness proteins in theyeast protein interaction network. J Biomed Biotechnol, 2005(2):96–103, 2005.

[119] R. Kafri, O. Dahan, J. Levy, and Y. Pilpel. Preferential protection of protein inter-action network hubs in yeast: evolved functionality of genetic redundancy. Proc Natl

Acad Sci U S A, 105(4):1243–1248, 2008.

[120] U. Karaoz, T. M. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif.Whole-genome annotation by using evidence integration in functional-linkage networks.Proc Natl Acad Sci U S A, 101(9):2888–2893, 2004.

142

[121] A. Karimpour-Fard, C. S. Detweiler, K. D. Erickson, L. Hunter, and R. T. Gill. Cross-species cluster co-conservation: a new method for generating protein interaction net-works. Genome Biol, 8(9):R185, 2007.

[122] F. Kashanchi, G. Piras, M. F. Radonovich, J. F. Duvall, A. Fattaey, C. M. Chiang,R. G. Roeder, and J. N. Brady. Direct interaction of human TFIID with the HIV-1transactivator Tat. Nature, 367(6460):295–299, 1994.

[123] C. W. Kauth, U. Woehlbier, M. Kern, Z. Mekonnen, R. Lutz, N. Mucke, J. Langowski,and H. Bujard. Interactions between merozoite surface proteins 1, 6, and 7 of themalaria parasite Plasmodium falciparum. J Biol Chem, 281(42):31517–31527, 2006.

[124] B. P. Kelley, R. Sharan, R. M. Karp, T. Sittler, D. E. Root, B. R. Stockwell, andT. Ideker. Conserved pathways within bacteria and yeast as revealed by global proteinnetwork alignment. Proc Natl Acad Sci U S A, 100(20):11394–11399, 2003.

[125] W. K. Kim, J. Park, and J. K. Suh. Large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. Genome Inform Ser

Workshop Genome Inform, 13:42–50, 2002.

[126] T. Kiyono, A. Hiraiwa, M. Fujita, Y. Hayashi, T. Akiyama, and M. Ishibashi. Bind-ing of high-risk human papillomavirus E6 oncoproteins to the human homologue ofthe Drosophila discs large tumor suppressor protein. Proc Natl Acad Sci U S A,94(21):11612–11616, 1997.

[127] M. Koziczak-Holbro, C. Joyce, A. Gluck, B. Kinzel, M. Muller, C. Tschopp, J. C. Math-ison, C. N. Davis, and H. Gram. IRAK-4 Kinase Activity Is Required for Interleukin-1(IL-1) Receptor- and Toll-like Receptor 7-mediated Signaling and Gene Expression. J

Biol Chem, 282(18):13552–13560, 2007.

[128] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu,N. Datta, A. P. Tikuisis, T. Punna, J. M. Peregrin-Alvarez, M. Shales, X. Zhang,M. Davey, M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P.Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete,J. Vlasblom, S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone,K. Gandi, N. J. Thompson, G. Musso, P. St Onge, S. Ghanny, M. H. Y. Lam, G. But-land, A. M. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles,T. R. Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili, and J. F. Greenblatt.Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature,440(7084):637–643, 2006.

[129] A. Krogh, B. Larsson, G. von Heijne, and E. L. Sonnhammer. Predicting transmem-brane protein topology with a hidden markov model: application to complete genomes.J Mol Biol, 305(3):567–580, 2001.

143

[130] M. Kundu, S. Sharma, A. De Luca, A. Giordano, J. Rappaport, K. Khalili, andS. Amini. HIV-1 Tat elongates the G1 phase and indirectly promotes HIV-1 geneexpression in cells of glial origin. J Biol Chem, 273(14):8130–8136, 1998.

[131] D. J. LaCount, M. Vignali, R. Chettier, A. Phansalkar, R. Bell, J. R. Hesselberth,L. W. Schoenfeld, I. Ota, S. Sahasrabudhe, C. Kurschner, S. Fields, and R. E. Hughes.A protein interaction network of the malaria parasite Plasmodium falciparum. Nature,438(7064):103–107, 2005.

[132] X. H. Lai, I. Golovliov, and A. Sjostedt. Francisella tularensis induces cytopathogenic-ity and apoptosis in murine macrophages via a mechanism that requires intracellularbacterial multiplication. Infect Immun, 69(7):4691–4694, 2001.

[133] S. E. Lang and P. Hearing. The adenovirus E1A oncoprotein recruits the cellularTRRAP/GCN5 histone acetyltransferase complex. Oncogene, 22(18):2836–2841, 2003.

[134] K. G. Le Roch, Y. Zhou, P. L. Blair, M. Grainger, J. K. Moch, J. D. Haynes, P. De LaVega, A. A. Holder, D. J. Carucci, and E. A. Winzeler. Discovery of gene function byexpression profiling of the malaria parasite life cycle. Science, 301(5639):1503–1508,2003.

[135] E. Le Rouzic, A. Mousnier, C. Rustum, F. Stutz, E. Hallberg, C. Dargemont, andS. Benichou. Docking of HIV-1 Vpr to the nuclear envelope is mediated by the inter-action with the nucleoporin hcg1. J Biol Chem, 277(47):45091–45098, 2002.

[136] M. S. Lechner and L. A. Laimins. Inhibition of p53 DNA binding by human papillo-mavirus E6 proteins. J Virol, 68(7):4262–4273, 1994.

[137] S. S. Lee, R. S. Weiss, and R. T. Javier. Binding of human virus oncoproteins tohDlg/SAP97, a mammalian homolog of the Drosophila discs large tumor suppressorprotein. Proc Natl Acad Sci U S A, 94(13):6670–6675, 1997.

[138] V. T. Lee and O. Schneewind. Protein secretion and the pathogenesis of bacterialinfections. Genes Dev, 15(14):1725–1752, 2001.

[139] C. J. Li, D. J. Friedman, C. Wang, V. Metelev, and A. B. Pardee. Induction of apoptosisin uninfected lymphocytes by HIV-1 Tat protein. Science, 268(5209):429–431, 1995.

[140] D. Li, J. Li, S. Ouyang, J. Wang, S. Wu, P. Wan, Y. Zhu, X. Xu, and F. He. Protein in-teraction networks of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila

melanogaster : large-scale organization and robustness. Proteomics, 6(2):456–461, 2006.

[141] L. Li, C. J. J. Stoeckert, and D. S. Roos. Orthomcl: identification of ortholog groupsfor eukaryotic genomes. Genome Res, 13(9):2178–2189, 2003.

144

[142] S. Li, C. M. Armstrong, N. Bertin, H. Ge, S. Milstein, M. Boxem, P.-O. Vidalain,J.-D. J. Han, A. Chesneau, T. Hao, D. S. Goldberg, N. Li, M. Martinez, J.-F. Rual,P. Lamesch, L. Xu, M. Tewari, S. L. Wong, L. V. Zhang, G. F. Berriz, L. Jacotot,P. Vaglio, J. Reboul, T. Hirozane-Kishikawa, Q. Li, H. W. Gabel, A. Elewa, B. Baum-gartner, D. J. Rose, H. Yu, S. Bosak, R. Sequerra, A. Fraser, S. E. Mango, W. M.Saxton, S. Strome, S. Van Den Heuvel, F. Piano, J. Vandenhaute, C. Sardet, M. Ger-stein, L. Doucette-Stamm, K. C. Gunsalus, J. W. Harper, M. E. Cusick, F. P. Roth,D. E. Hill, and M. Vidal. A map of the interactome network of the metazoan C.elegans. Science, 303(5657):540–543, Jan 2004.

[143] B. L. Ligon. Plague: a review of its history and potential as a biological weapon.Semin Pediatr Infect Dis, 17(3):161–170, 2006.

[144] W. Lin, S. S. Kim, E. Yeung, Y. Kamegaya, J. T. Blackard, K. A. Kim, M. J. Holtz-man, and R. T. Chung. Hepatitis C virus core protein blocks interferon signaling byinteraction with the STAT1 SH2 domain. J Virol, 80(18):9226–9235, 2006.

[145] F. Liu and M. R. Green. Promoter targeting by adenovirus E1a through interactionwith different cellular DNA-binding domains. Nature, 368(6471):520–525, 1994.

[146] M. Liu, A. Liberzon, S. W. Kong, W. R. Lai, P. J. Park, I. S. Kohane, and S. Kasif.Network-based analysis of affected biological processes in type 2 diabetes models. PLoS

Genet, 3(6):e96, 2007.

[147] X. Liu, A. Clements, K. Zhao, and R. Marmorstein. Structure of the human Papil-lomavirus E7 oncoprotein and its mechanism for inactivation of the retinoblastomatumor suppressor. J Biol Chem, 281(1):578–586, 2006.

[148] F. Longo, M. A. Marchetti, L. Castagnoli, P. A. Battaglia, and F. Gigliani. A novelapproach to protein-protein interaction: complex formation between the p53 tumorsuppressor and the HIV Tat proteins. Biochem Biophys Res Commun, 206(1):326–334, 1995. Comparative Study.

[149] D. C. Look, W. T. Roswit, A. G. Frick, Y. Gris-Alevy, D. M. Dickhaus, M. J. Walter,and M. J. Holtzman. Direct suppression of Stat1 function during adenoviral infection.Immunity, 9(6):871–880, 1998.

[150] W. Lu, S. Y. Lo, M. Chen, K. j. Wu, Y. K. Fung, and J. H. Ou. Activation of p53tumor suppressor by hepatitis C virus core protein. Virology, 264(1):134–141, 1999.

[151] X. Lu, V. V. Jain, P. W. Finn, and D. L. Perkins. Hubs in biological interactionnetworks exhibit low changes in expression in experimental asthma. Mol Syst Biol,3:98, 2007.

[152] J. Ma and M. Ptashne. A new class of yeast transcriptional activators. Cell, 51(1):113–119, 1987.

145

[153] E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg.Detecting protein function and protein-protein interactions from genome sequences.Science, 285(5428):751–753, 1999.

[154] L. R. Matthews, P. Vaglio, J. Reboul, H. Ge, B. P. Davis, J. Garrels, S. Vincent, andM. Vidal. Identification of potential interaction networks using sequence-based searchesfor conserved protein-protein interactions or “interologs”. Genome Res, 11(12):2120–2126, 2001.

[155] J. E. McInturff, R. L. Modlin, and J. Kim. The role of toll-like receptors in thepathogenesis and treatment of dermatological disease. J Invest Dermatol, 125(1):1–8,2005.

[156] R. Medzhitov, P. Preston-Hurlburt, and C. A. J. Janeway. A human homologueof the Drosophila Toll protein signals activation of adaptive immunity. Nature,388(6640):394–397, 1997.

[157] A. Meier, G. Alter, N. Frahm, H. Sidhu, B. Li, A. Bagchi, N. Teigen, H. Streeck,H. Stellbrink, J. Hellman, J. van Lunzen, and M. Altfeld. Myd88-dependent immuneactivation mediated by hiv-1-encoded tlr ligands. J Virol, 2007.

[158] H. W. Mewes, D. Frishman, U. Guldener, G. Mannhaupt, K. Mayer, M. Mokrejs,B. Morgenstern, M. Munsterkotter, S. Rudd, and B. Weil. Mips: a database forgenomes and protein sequences. Nucleic Acids Res, 30(1):31–34, 2002.

[159] G. R. Mishra, M. Suresh, K. Kumaran, N. Kannabiran, S. Suresh, P. Bala, K. Shivaku-mar, N. Anuradha, R. Reddy, T. M. Raghavan, S. Menon, G. Hanumanthu, M. Gupta,S. Upendran, S. Gupta, M. Mahesh, B. Jacob, P. Mathew, P. Chatterjee, K. S. Arun,S. Sharma, K. N. Chandrika, N. Deshpande, K. Palvankar, R. Raghavnath, R. Kr-ishnakanth, H. Karathia, B. Rekha, R. Nayak, G. Vishnupriya, H. G. M. Kumar,M. Nagini, G. S. S. Kumar, R. Jose, P. Deepthi, S. S. Mohan, T. K. B. Gandhi, H. C.Harsha, K. S. Deshpande, M. Sarker, T. S. K. Prasad, and A. Pandey. Human proteinreference database–2006 update. Nucleic Acids Res, 34(Database issue):D411–4, 2006.

[160] S. B. Mizel, A. N. Honko, M. A. Moors, P. S. Smith, and A. P. West. Inductionof macrophage nitric oxide production by Gram-negative flagellin involves signal-ing via heteromeric Toll-like receptor 5/Toll-like receptor 4 complexes. J Immunol,170(12):6217–6223, 2003.

[161] T. H. Mogensen, S. R. Paludan, M. Kilian, and L. Ostergaard. Live Streptococcuspneumoniae, Haemophilus influenzae, and Neisseria meningitidis activate the inflam-matory response through Toll-like receptors 2, 4, and 9 in species-specific patterns. J

Leukoc Biol, 80(2):267–277, 2006.

146

[162] A. Muller, A. Ritzkowsky, and G. Steger. Cooperative activation of human papillo-mavirus type 8 gene expression by the E2 protein and the cellular coactivator p300. J

Virol, 76(21):11042–11053, 2002.

[163] L. M. Nabell, R. H. Raja, P. P. Sayeski, A. J. Paterson, and J. E. Kudlow. Humanimmunodeficiency virus 1 Tat stimulates transcription of the transforming growth fac-tor alpha gene in an epidermal growth factor-dependent manner. Cell Growth Differ,5(1):87–93, 1994.

[164] A. Nakanishi, D. Shum, H. Morioka, E. Otsuka, and H. Kasamatsu. Interaction of theVp3 nuclear localization signal with the importin alpha 2/beta heterodimer directsnuclear entry of infecting simian virus 40. J Virol, 76(18):9368–9377, 2002.

[165] M. Narayanan and R. M. Karp. Comparing protein interaction networks via a graphmatch-and-split algorithm. J Comput Biol, 14(7):892–907, 2007.

[166] L. M. Nelson, R. C. Rose, and J. Moroianu. The L1 major capsid protein of humanpapillomavirus type 11 interacts with Kap β2 and Kap β3 nuclear import receptors.Virology, 306(1):162–169, 2003.

[167] S. K. Ng, Z. Zhang, and S. H. Tan. Integrative approach for computationally inferringprotein domain interactions. Bioinformatics, 19(8):923–929, 2003.

[168] C. F. Ockenhouse, W. C. Hu, K. E. Kester, J. f. Cummings, A. Stewart, D. G. Hepp-ner, A. E. Jedlicka, A. L. Scott, N. D. Wolfe, M. Vahey, and D. S. Burke. Common anddivergent immune response signaling pathways discovered in peripheral blood mononu-clear cell gene expression patterns in presymptomatic and clinically apparent malaria.Infect Immun, 74(10):5561–5573, 2006.

[169] R. O’Donnell and M. J. Blackman. The role of malaria meroite proteases in red bloodcell invasion. Curr Opin Microbiol, 8(4):422–427, 2005.

[170] M. Otsuka, N. Kato, K. Lan, H. Yoshida, J. Kato, T. Goto, Y. Shiratori, and M. Omata.Hepatitis C virus core protein enhances p53 function through augmentation of DNAbinding affinity and transcriptional ability. J Biol Chem, 275(44):34122–34130, 2000.

[171] J.-R. Pallandre, E. Brillard, G. Crehange, A. Radlovic, J.-P. Remy-Martin, P. Saas,P.-S. Rohrlich, X. Pivot, X. Ling, P. Tiberghien, and C. Borg. Role of stat3 incd4+cd25+foxp3+ regulatory lymphocyte generation: implications in graft-versus-host disease and antitumor immunity. J Immunol, 179(11):7593–7604, 2007.

[172] J. M. Park, F. R. Greten, Z.-W. Li, and M. Karin. Macrophage apoptosis by anthraxlethal factor through p38 map kinase inhibition. Science, 297(5589):2048–2051, 2002.

[173] D. Patel, S. M. Huang, L. A. Baglia, and D. J. McCance. The E6 protein of humanpapillomavirus type 16 binds to and inhibits co-activation by CBP and p300. EMBO

J, 18(18):5061–5072, 1999.

147

[174] T. Pawson and P. Nash. Assembly of cell regulatory systems through protein interac-tion domains. Science, 300(5618):445–452, 2003.

[175] F. Pazos and A. Valencia. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng, 14(9):609–614, 2001.

[176] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates.Assigning protein functions by comparative genome analysis: protein phylogeneticprofiles. Proc Natl Acad Sci U S A, 96(8):4285–4288, 1999.

[177] K. Petersson, M. Thunnissen, G. Forsberg, and B. Walse. Crystal structure of a SEAvariant in complex with MHC class II reveals the ability of SEA to crosslink MHCmolecules. Structure, 10(12):1619–1626, 2002.

[178] K. S. Pfennig. Evolution of pathogen virulence: the role of variation in host phenotype.Proc Biol Sci, 268(1468):755–760, 2001.

[179] S. G. Popov, T. G. Popova, E. Grene, F. Klotz, J. Cardwell, C. Bradburne, Y. Jama,M. Maland, J. Wells, A. Nalca, T. Voss, C. Bailey, and K. Alibek. Systemic cytokineresponse in murine anthrax. Cell Microbiol, 6(3):225–233, 2004.

[180] D. L. Poulin, A. L. Kung, and J. A. DeCaprio. p53 targets simian virus 40 large Tantigen for acetylation by CBP. J Virol, 78(15):8245–8253, 2004.

[181] M. V. Prasad and G. Shanmugam. Retinoblastoma gene inhibits transactivation ofHIV-LTR linked gene expression upon co-transfection in He La cells. Biochem Mol

Biol Int, 29(1):57–62, 1993.

[182] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig,L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclusteringmethods for gene expression data. Bioinformatics, 22(9):1122–1129, 2006.

[183] M. J. Pryor, S. M. Rawlinson, R. E. Butcher, C. L. Barton, T. A. Waterhouse, S. G.Vasudevan, P. G. Bardin, P. J. Wright, D. A. Jans, and A. D. Davidson. Nuclearlocalization of dengue virus nonstructural protein 5 through its importin alpha/beta-recognized nuclear localization sequences is integral to viral infection. Traffic, 8(7):795–807, 2007.

[184] T. Punga and G. Akusjarvi. The adenovirus-2 e1b-55k protein interacts with amsin3a/histone deacetylase 1 complex. FEBS Lett, 476(3):248–252, 2000.

[185] T. Punga and G. Akusjarvi. Adenovirus 2 E1B-55K protein relieves p53-mediatedtranscriptional repression of the survivin and MAP4 promoters. FEBS Lett, 552(2-3):214–218, 2003. In Vitro.

148

[186] Y. Qi, Z. Bar-Joseph, and J. Klein-Seetharaman. Evaluation of different biologicaldata and computational classification methods for use in protein interaction prediction.Proteins, 63(3):490–500, 2006.

[187] E. Quevillon, V. Silventoinen, S. Pillai, N. Harte, N. Mulder, R. Apweiler, andR. Lopez. InterProScan: protein domains identifier. Nucleic Acids Res, 33(Web Serverissue):W116–20, 2005.

[188] J. Rachlin, D. D. Cohen, C. Cantor, and S. Kasif. Biological context networks: amosaic view of the interactome. Mol Syst Biol, 2:66, 2006.

[189] M. Reichelt, K. D. Zang, M. Seifert, C. Welter, and T. Ruffing. The yeast two-hybridsystem reveals no interaction between p73 alpha and SV40 large T-antigen. Arch Virol,144(3):621–626, 1999.

[190] M. Remm, C. E. Storm, and E. L. Sonnhammer. Automatic clustering of orthologs andin-paralogs from pairwise species comparisons. J Mol Biol, 314(5):1041–1052, 2001.

[191] C. Rivera. Identifying evolutionarily conserved protein interaction networks.Master’s thesis, Virginia Polytechnic and State University, 2005. Available athttp://scholar.lib.vt.edu/theses/available/etd-06162005-140525.

[192] E. Roggwiller, A. C. Fricaud, T. Blisnick, and C. Braun-Brenton. Host urokinase-typeplasminogen activator participates in the release of malaria merozoites from infectederythrocytes. Mol Biochem Parasitol, 86(1):49–59, 1997.

[193] J. A. Rowe, J. M. Moulds, C. I. Newbold, and L. H. Miller. P. falciparum rosetting me-diated by a parasite-variant erythrocyte membrane protein and complement-receptor1. Nature, 388(6639):292–295, 1997.

[194] J.-F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F.Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon,M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong,G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex,P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak,R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth, and M. Vi-dal. Towards a proteome-scale map of the human protein-protein interaction network.Nature, 437(7062):1173–1178, Oct 2005.

[195] R. Ruediger, N. Brewis, K. Ohst, and G. Walter. Increasing the ratio of PP2A core en-zyme to holoenzyme inhibits Tat-stimulated HIV-1 transcription and virus production.Virology, 238(2):432–443, 1997.

[196] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisenberg.The Database of Interacting Proteins: 2004 update. Nucleic Acids Res, 32:D449–51,2004.

149

[197] H. Sanjo, T. Kawai, and S. Akira. Draks, novel serine/threonine kinases related todeath-associated protein kinase that trigger apoptosis. J Biol Chem, 273(44):29066–29071, 1998.

[198] Q. J. Sattentau, A. G. Dalgleish, R. A. Weiss, and P. C. Beverley. Epitopes of the cd4antigen and hiv infection. Science, 234(4780):1120–1123, 1986.

[199] B. E. Sawaya, K. Khalili, J. Gordon, R. Taube, and S. Amini. Cooperative interactionbetween HIV-1 regulatory proteins Tat and Vpr modulates transcription of the viralgenome. J Biol Chem, 275(45):35209–35214, 2000.

[200] M. Seeger, K. Ferrell, R. Frank, and W. Dubiel. HIV-1 tat inhibits the 20 S proteasomeand its 11 S regulator-mediated activation. J Biol Chem, 272(13):8145–8148, 1997.

[201] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin,B. Schwikowski, and T. Ideker. Cytoscape: a software environment for integratedmodels of biomolecular interaction networks. Genome Res, 13(11):2498–2504, 2003.

[202] R. Sharan and T. Ideker. Modeling cellular machinery through biological networkcomparison. Nat Biotechnol, 24(4):427–433, 2006.

[203] R. Sharan, T. Ideker, B. P. Kelley, R. Shamir, and R. M. Karp. Identification ofprotein complexes by comparative analysis of yeast and bacterial protein interactiondata. In RECOMB ’04: Proceedings of the eighth annual international conference on

Computational molecular biology, pages 282–289, New York, NY, USA, 2004. ACMPress.

[204] R. Sharan, S. Suthram, R. M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler, R. M.Karp, and T. Ideker. Conserved patterns of protein interaction in multiple species.Proc Natl Acad Sci U S A, 102(6):1974–1979, 2005.

[205] R. Sharan, I. Ulitsky, and R. Shamir. Network-based prediction of protein function.Mol Syst Biol, 3:88, 2007.

[206] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and H. Jiang. Predictingprotein-protein interactions based only on sequences information. Proc Natl Acad Sci

U S A, 104(11):4337–4341, 2007.

[207] A. J. Sinclair, M. Fenton, and S. Delikat. Interactions between Epstein-Barr virus andthe cell cycle control machinery. Histol Histopathol, 13(2):461–467, 1998.

[208] C. Soderberg-Naucler. Does cytomegalovirus play a causative role in the developmentof various inflammatory diseases and cancer? J Intern Med, 259(3):219–46, 2006.

[209] S. Southern and C. Herrington. Disruption of cell cycle control by human papil-lomaviruses with special reference to cervical carcinoma. Int J Gynecol Cancer,10(4):263–274, 2000.

150

[210] D. Spitkovsky, F. Aengeneyndt, J. Braspenning, and M. von Knebel Doeberitz. p53-independent growth regulation of cervical cancer cells by the papillomavirus E6 onco-gene. Oncogene, 13(5):1027–1035, 1996.

[211] E. Sprinzak and H. Margalit. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol, 311(4):681–692, 2001.

[212] C. Staib, J. Pesch, R. Gerwig, J. K. Gerber, U. Brehm, A. Stangl, and F. Grummt. p53inhibits JC virus DNA replication in vivo and interacts with JC virus large T-antigen.Virology, 219(1):237–246, 1996.

[213] M. N. Starnbach and M. J. Bevan. Cells infected with yersinia present an epitope toclass i mhc-restricted ctl. J Immunol, 153(4):1603–1612, 1994.

[214] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler,M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abra-ham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Krobitsch, B. Korn,W. Birchmeier, H. Lehrach, and E. E. Wanker. A human protein-protein interactionnetwork: a resource for annotating the proteome. Cell, 122(6):957–968, Sep 2005.

[215] A. M. Stock, V. L. Robinson, and P. N. Goudreau. Two-component signal transduction.Annu Rev Biochem, 69:183–215, 2000.

[216] C. J. Stoeckert Jr., S. Fischer, J. C. Kissinger, M. Heiges, C. Aurrecoechea, B. Gajria,and D. S. Roos. PlasmoDB v5: new looks, new genomes. Trends Parasitol, 22(12):543–546, 2006.

[217] J. Stubbs, K. M. Simpson, T. Triglia, D. Plouffe, C. J. Tonkin, M. T. Duraisingh, A. G.Maier, E. A. Winzeler, and A. F. Cowman. Molecular mechanism for switching of P.

falciparum invasion pathways into human erythrocytes. Science, 309(5739):1384–1387,2005.

[218] M. P. H. Stumpf, C. Wiuf, and R. M. May. Subnets of scale-free networks are not scale-free: sampling properties of networks. Proc Natl Acad Sci U S A, 102(12):4221–4224,2005.

[219] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette,A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Geneset enrichment analysis: A knowledge-based approach for interpreting genome-wideexpression profiles. PNAS, 102:15545–15550, 2005.

[220] S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and T. Ideker. A direct comparison ofprotein interaction confidence assignment schemes. BMC Bioinformatics, 7:360, 2006.

[221] T. Swigut, N. Shohdy, and J. Skowronski. Mechanism for down-regulation of CD28 byNef. EMBO J, 20(7):1593–1604, 2001.

151

[222] J. Tamames, G. Casari, C. Ouzounis, and A. Valencia. Conserved clusters of function-ally related genes in two bacterial genomes. J Mol Evol, 44(1):66–73, 1997.

[223] S.-L. Tan, G. Ganji, B. Paeper, S. Proll, and M. G. Katze. Systems biology and thehost response to viral infection. Nat Biotechnol, 25(12):1383–1389, 2007.

[224] M. Tanaka, T. Ueno, T. Nakahara, K. Sasaki, A. Ishimoto, and H. Sakai. Downreg-ulation of CD4 is required for maintenance of viral infectivity of HIV-1. Virology,311(2):316–325, 2003.

[225] R. L. Tatusov, M. Y. Galperin, D. A. Natale, and E. V. Koonin. The cog database: atool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res,28(1):33–36, 2000.

[226] L. H. Taylor, M. J. Mackinnon, and A. F. Read. Virulence of mixed-clone and single-clone infections of the rodent malaria Plasmodium chabaudi. Evolution, 52:583–591,1998.

[227] K. Tewari, J. Walent, J. Svaren, R. Zamoyska, and M. Suresh. Differential requirementfor lck during primary and memory cd8+ t cell responses. Proc Natl Acad Sci U S A,103(44):16388–16393, 2006.

[228] M. Thomas, P. Massimi, C. Navarro, J.-P. Borg, and L. Banks. The hScrib/Dlg apico-basal control complex is differentially targeted by HPV-16 and HPV-18 E6 proteins.Oncogene, 24(41):6222–6230, 2005.

[229] D. A. Thompson, G. Belinsky, T. H. Chang, D. L. Jones, R. Schlegel, and K. Munger.The human papillomavirus-16 E6 oncoprotein decreases the vigilance of mitotic check-points. Oncogene, 15(25):3025–3035, 1997.

[230] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitiv-ity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673–4680,1994.

[231] R. Truant and B. R. Cullen. The arginine-rich domains present in human immunode-ficiency virus type 1 Tat and Rev function as direct importin beta-dependent nuclearlocalization signals. Mol Cell Biol, 19(2):1210–1217, 1999. In Vitro.

[232] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lock-shon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin,D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, andJ. M. Rothberg. A comprehensive analysis of protein-protein interactions in Saccha-

romyces cerevisiae. Nature, 403(6770):623–627, Feb 2000.

152

[233] I. Ulitsky and R. Shamir. Identification of functional modules using network topologyand high-throughput data. BMC Syst Biol, 1:8, 2007.

[234] R. R. Valiathan and M. D. Resh. Expression of human immunodeficiency virus type 1gag modulates ligand-induced downregulation of EGF receptor. J Virol, 78(22):12386–12394, 2004.

[235] A. C. Vendel and K. J. Lumb. Molecular recognition of the human coactivator CBPby the HIV-1 transcriptional activator Tat. Biochemistry, 42(4):910–916, 2003.

[236] N. J. Venkatachari, B. Majumder, and V. Ayyavoo. Human immunodeficiency virus(HIV) type 1 Vpr induces differential regulation of T cell costimulatory molecules:direct effect of Vpr on T cell activation and immune function. Virology, 358(2):347–356, 2007.

[237] X. W. Wang, M. K. Gibson, W. Vermeulen, H. Yeh, K. Forrester, H. W. Sturzbecher,J. H. Hoeijmakers, and C. C. Harris. Abrogation of p53-induced apoptosis by thehepatitis B virus X gene. Cancer Res, 55(24):6012–6016, 1995.

[238] A. Watari and M. Yutsudo. Multi-functional gene asy/nogo/rtn-x/rtn4: apoptosis,tumor suppression, and inhibition of neuronal regeneration. Apoptosis, 8(1):5–9, 2003.

[239] A. C. Wilson, S. S. Carlson, and T. J. White. Biochemical evolution. Annu Rev

Biochem, 46:573–639, 1977.

[240] C. Withers-Martinez, L. Jean, and M. Blackman. Subtilisin-like proteases of themalaria parasite. Mol Microbiol, 53(1):55–63, 2004.

[241] R. B. Yang, M. R. Mark, A. Gray, A. Huang, M. H. Xie, M. Zhang, A. God-dard, W. I. Wood, A. L. Gurney, and P. J. Godowski. Toll-like receptor-2 mediateslipopolysaccharide-induced cellular signalling. Nature, 395(6699):284–288, 1998.

[242] M. A. Yildirim, K.-I. Goh, M. E. Cusick, A.-L. Barabasi, and M. Vidal. Drug-targetnetwork. Nat Biotechnol, 25(10):1119–1126, 2007.

[243] H. Yu, P. M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein. The importance ofbottlenecks in protein networks: correlation with gene essentiality and expression dy-namics. PLoS Computational Biology, 3(4):e59, 2007.

[244] H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J. J. Han, N. Bertin, S. Chung,M. Vidal, and M. Gerstein. Annotation transfer between genomes: protein-proteininterologs and protein-DNA regulogs. Genome Res, 14:1107–1118, 2004.

[245] H. Yu, A. Paccanaro, V. Trifonov, and M. Gerstein. Predicting interactions in proteinnetworks by completing defective cliques. Bioinformatics, 22(7):823–829, 2006.

153

[246] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich, andG. Cesareni. Mint: a Molecular INTeraction database. FEBS Lett, 513:135–140, 2002.

[247] L. Zhang, S. Wong, O. King, and F. Roth. Predicting co-complexed protein pairs usinggenomic and proteomic data integration. BMC Bioinformatics, 5:38, 2004.

[248] Y. Zhang and J. B. Bliska. Role of toll-like receptor signaling in the apoptotic responseof macrophages to yersinia infection. Infect Immun, 71(3):1513–1519, 2003.

[249] Y. Zhang and J. B. Bliska. Role of macrophage apoptosis in the pathogenesis ofYersinia. Curr Top Microbiol Immunol, 289:151–173, 2005.

[250] Y. Zhao, Z. Li, S. J. Drozd, Y. Guo, W. Mourad, and H. Li. Crystal structure of My-

coplasma arthritidis mitogen complexed with HLA-DR1 reveals a novel superantigenfold and a dimerized superantigen-MHC complex. Structure, 12(2):277–288, 2004.

[251] C. E. Zhou, J. Smith, M. Lam, A. Zemla, M. D. Dyer, and T. Slezak. Mvirdb–amicrobial database of protein toxins, virulence factors and antibiotic resistance genesfor bio-defence applications. Nucleic Acids Res, 35(Database issue):D391–4, 2007.

[252] Y. Zhu, M. Roshal, F. Li, J. Blackett, and V. Planelles. Upregulation of survivin byHIV-1 Vpr. Apoptosis, 8(1):71–79, 2003.

Documents

Pathosystems Biology: Computational Prediction and Analysis of Host-Pathogen … · 2020-01-17 · Pathosystems Biology: Computational Prediction and Analysis of Host-Pathogen Protein