14
Controllability analysis of the directed human protein interaction network identifies disease genes and drug targets Arunachalam Vinayagam a,1 , Travis E. Gibson b , Ho-Joon Lee c,2 , Bahar Yilmazel d,e , Charles Roesel d,e,3 , Yanhui Hu a,d , Young Kwon a , Amitabh Sharma b,f,g , Yang-Yu Liu b,f,g,1 , Norbert Perrimon a,h,1 , and Albert-László Barabási f,g,1 a Department of Genetics, Harvard Medical School, Boston, MA 02115; b Channing Division of Network Medicine, Brigham and Womens Hospital, Harvard Medical School, Boston, MA 02115; c Department of Systems Biology, Harvard Medical School, Boston, MA 02115; d Drosophila RNAi Screening Center, Department of Genetics, Harvard Medical School, Boston, MA 02115; e Bioinformatics Program, Northeastern University, Boston, MA 02115; f Center for Complex Network Research, Department of Physics, Northeastern University, Boston, MA 02115; g Center for Cancer Systems Biology, Dana-Farber Cancer Institute, Boston, MA 02115; and h Howard Hughes Medical Institute, Harvard Medical School, MA 02115 Contributed by Norbert Perrimon, March 24, 2016 (sent for review November 25, 2015; reviewed by Reka Albert and Tatsuya Akutsu) The proteinprotein interaction (PPI) network is crucial for cellular information processing and decision-making. With suitable inputs, PPI networks drive the cells to diverse functional outcomes such as cell proliferation or cell death. Here, we characterize the structural controllability of a large directed human PPI network comprising 6,339 proteins and 34,813 interactions. This network allows us to classify proteins as indispensable,”“neutral,or dispensable,which correlates to increasing, no effect, or decreasing the number of driver nodes in the network upon removal of that protein. We find that 21% of the proteins in the PPI network are indispensable. Interestingly, these indispensable proteins are the primary targets of disease-causing mutations, human viruses, and drugs, suggest- ing that altering a networks control property is critical for the transition between healthy and disease states. Furthermore, ana- lyzing copy number alterations data from 1,547 cancer patients reveals that 56 genes that are frequently amplified or deleted in nine different cancers are indispensable. Among the 56 genes, 46 of them have not been previously associated with cancer. This suggests that controllability analysis is very useful in identifying novel disease genes and potential drug targets. network biology | controllability | proteinprotein interaction network | disease genes | drug targets T he need to control engineered systems has resulted in a mathematically rich set of tools that are widely applied in the design of electric circuits, manufacturing processes, communi- cation systems, aircraft, spacecraft, and robots (13). Control theory deals with the design and stability analysis of dynamic systems that receive information via inputs and have outputs available for measurement. Issues of control and regulation are central to the study of biological systems (4, 5), which sense and process both external and internal cues using a network of interacting molecules (6). The dynamic regulation of this mo- lecular network in turn drives the system to various functional states, such as triggering cell proliferation or inducing apoptosis. This feature of specific input signals driving networks from an initial state to a specific functional state suggests that the need to control a biological system plays a potentially important role in the evolution of molecular interaction networks. Note that the term stateis also used in a control context where the state spaceof a control system is the space of values the state var- iablescan attain. For a proteinprotein interaction (PPI) net- work, the state variables are the specific protein concentrations and the state space is all positive real numbers of dimension equal to the total number of proteins in the PPI network. According to control theory, a dynamic system is controllable if, with a suitable choice of inputs, the system can be driven from any initial state to any desired final state in finite time (2, 7). Previous studies have shown that network components exhibit properties of control systems such as proportional action, feed- back control, and feed-forward control (812). However, the main challenges that hinder systematic controllability analysis of biological networks are the availability of large-scale biologically relevant networks and efficient tools to analyze their controlla- bility. To address these issues, two resources were integrated in this work: (i ) a directed human PPI network (13); and (ii ) an analytical framework to characterize the structural controllability of directed weighted networks (14). The directed human PPI network represents a global snapshot of the information flow in cell signaling. For a given weighted and directed network associ- ated with linear time-invariant dynamics, the analytical framework identifies a minimum set of driver nodes, whose control is suffi- cient to fully control the dynamics of the whole network (14, 15). In this work, we classified the proteins (nodes) as in- dispensable,”“neutral,or dispensable,based on the change of the minimum number of driver nodes needed to control the PPI network when a specific protein (node) is absent. In addi- tion, we analyzed the role of different node types in the context of human diseases. Using known examples of disease-causing Significance Large-scale biological network analyses often use concepts used in social networks analysis (e.g. finding communities,”“hubs,etc.). However, mathematically advanced engineering concepts have only been applied to analyze small and well-characterized networks so far in biology. Here, we applied a sophisticated engineering tool, from control theory, to analyze a large-scale directed human proteinprotein interaction network. Our anal- ysis revealed that the proteins that are indispensable, from a network controllability perspective, are also commonly targeted by disease-causing mutations and human viruses or have been identified as drug targets. Furthermore, we used the controlla- bility analysis to prioritize novel cancer genes from cancer ge- nomic datasets. Altogether, we demonstrated an application of network controllability analysis to identify new disease genes and drug targets. Author contributions: A.V., Y.-Y.L., N.P., and A.-L.B. designed research; A.V., T.E.G., and Y.-Y.L. performed research; H.-J.L., B.Y., C.R., Y.H., Y.K., and A.S. contributed new re- agents/analytic tools; A.V., T.E.G., and Y.-Y.L. analyzed data; and A.V., T.E.G., Y.-Y.L., and A.-L.B. wrote the paper. Reviewers: R.A., Pennsylvania State University; T.A., Kyoto University. The authors declare no conflict of interest. 1 To whom correspondence may be addressed. Email: [email protected], [email protected], [email protected], or [email protected]. 2 Present address: Department of Laboratory Medicine, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, 062-73, Korea. 3 Present address: Marine Science Center, Northeastern University, Nahant, MA 01908. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1603992113/-/DCSupplemental. 49764981 | PNAS | May 3, 2016 | vol. 113 | no. 18 www.pnas.org/cgi/doi/10.1073/pnas.1603992113

Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

Controllability analysis of the directed human proteininteraction network identifies disease genes anddrug targetsArunachalam Vinayagama,1, Travis E. Gibsonb, Ho-Joon Leec,2, Bahar Yilmazeld,e, Charles Roeseld,e,3, Yanhui Hua,d,Young Kwona, Amitabh Sharmab,f,g, Yang-Yu Liub,f,g,1, Norbert Perrimona,h,1, and Albert-László Barabásif,g,1

aDepartment of Genetics, Harvard Medical School, Boston, MA 02115; bChanning Division of Network Medicine, Brigham and Women’s Hospital, HarvardMedical School, Boston, MA 02115; cDepartment of Systems Biology, Harvard Medical School, Boston, MA 02115; dDrosophila RNAi Screening Center,Department of Genetics, Harvard Medical School, Boston, MA 02115; eBioinformatics Program, Northeastern University, Boston, MA 02115; fCenter forComplex Network Research, Department of Physics, Northeastern University, Boston, MA 02115; gCenter for Cancer Systems Biology, Dana-Farber CancerInstitute, Boston, MA 02115; and hHoward Hughes Medical Institute, Harvard Medical School, MA 02115

Contributed by Norbert Perrimon, March 24, 2016 (sent for review November 25, 2015; reviewed by Reka Albert and Tatsuya Akutsu)

The protein–protein interaction (PPI) network is crucial for cellularinformation processing and decision-making. With suitable inputs,PPI networks drive the cells to diverse functional outcomes such ascell proliferation or cell death. Here, we characterize the structuralcontrollability of a large directed human PPI network comprising6,339 proteins and 34,813 interactions. This network allows us toclassify proteins as “indispensable,” “neutral,” or “dispensable,”which correlates to increasing, no effect, or decreasing the numberof driver nodes in the network upon removal of that protein. Wefind that 21% of the proteins in the PPI network are indispensable.Interestingly, these indispensable proteins are the primary targetsof disease-causing mutations, human viruses, and drugs, suggest-ing that altering a network’s control property is critical for thetransition between healthy and disease states. Furthermore, ana-lyzing copy number alterations data from 1,547 cancer patientsreveals that 56 genes that are frequently amplified or deleted innine different cancers are indispensable. Among the 56 genes, 46of them have not been previously associated with cancer. Thissuggests that controllability analysis is very useful in identifyingnovel disease genes and potential drug targets.

network biology | controllability | protein–protein interaction network |disease genes | drug targets

The need to control engineered systems has resulted in amathematically rich set of tools that are widely applied in the

design of electric circuits, manufacturing processes, communi-cation systems, aircraft, spacecraft, and robots (1–3). Controltheory deals with the design and stability analysis of dynamicsystems that receive information via inputs and have outputsavailable for measurement. Issues of control and regulation arecentral to the study of biological systems (4, 5), which sense andprocess both external and internal cues using a network ofinteracting molecules (6). The dynamic regulation of this mo-lecular network in turn drives the system to various functionalstates, such as triggering cell proliferation or inducing apoptosis.This feature of specific input signals driving networks from aninitial state to a specific functional state suggests that the need tocontrol a biological system plays a potentially important role inthe evolution of molecular interaction networks. Note that theterm “state” is also used in a control context where the “statespace” of a control system is the space of values the “state var-iables” can attain. For a protein–protein interaction (PPI) net-work, the state variables are the specific protein concentrationsand the state space is all positive real numbers of dimensionequal to the total number of proteins in the PPI network.According to control theory, a dynamic system is controllable

if, with a suitable choice of inputs, the system can be driven fromany initial state to any desired final state in finite time (2, 7).Previous studies have shown that network components exhibit

properties of control systems such as proportional action, feed-back control, and feed-forward control (8–12). However, themain challenges that hinder systematic controllability analysis ofbiological networks are the availability of large-scale biologicallyrelevant networks and efficient tools to analyze their controlla-bility. To address these issues, two resources were integrated inthis work: (i) a directed human PPI network (13); and (ii) ananalytical framework to characterize the structural controllabilityof directed weighted networks (14). The directed human PPInetwork represents a global snapshot of the information flow incell signaling. For a given weighted and directed network associ-ated with linear time-invariant dynamics, the analytical frameworkidentifies a minimum set of driver nodes, whose control is suffi-cient to fully control the dynamics of the whole network (14, 15).In this work, we classified the proteins (nodes) as “in-

dispensable,” “neutral,” or “dispensable,” based on the changeof the minimum number of driver nodes needed to control thePPI network when a specific protein (node) is absent. In addi-tion, we analyzed the role of different node types in the contextof human diseases. Using known examples of disease-causing

Significance

Large-scale biological network analyses often use concepts usedin social networks analysis (e.g. finding “communities,” “hubs,”etc.). However, mathematically advanced engineering conceptshave only been applied to analyze small and well-characterizednetworks so far in biology. Here, we applied a sophisticatedengineering tool, from control theory, to analyze a large-scaledirected human protein–protein interaction network. Our anal-ysis revealed that the proteins that are indispensable, from anetwork controllability perspective, are also commonly targetedby disease-causing mutations and human viruses or have beenidentified as drug targets. Furthermore, we used the controlla-bility analysis to prioritize novel cancer genes from cancer ge-nomic datasets. Altogether, we demonstrated an application ofnetwork controllability analysis to identify new disease genesand drug targets.

Author contributions: A.V., Y.-Y.L., N.P., and A.-L.B. designed research; A.V., T.E.G., andY.-Y.L. performed research; H.-J.L., B.Y., C.R., Y.H., Y.K., and A.S. contributed new re-agents/analytic tools; A.V., T.E.G., and Y.-Y.L. analyzed data; and A.V., T.E.G., Y.-Y.L.,and A.-L.B. wrote the paper.

Reviewers: R.A., Pennsylvania State University; T.A., Kyoto University.

The authors declare no conflict of interest.1To whom correspondence may be addressed. Email: [email protected],[email protected], [email protected], or [email protected].

2Present address: Department of Laboratory Medicine, Gangnam Severance Hospital,Yonsei University College of Medicine, Seoul, 062-73, Korea.

3Present address: Marine Science Center, Northeastern University, Nahant, MA 01908.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1603992113/-/DCSupplemental.

4976–4981 | PNAS | May 3, 2016 | vol. 113 | no. 18 www.pnas.org/cgi/doi/10.1073/pnas.1603992113

Page 2: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

mutations, virus targets, and drug targets, we identified in-dispensable nodes that are key players in mediating thetransition between healthy and disease states. Our study illus-trates the potential application of network controllability analysisas a powerful tool to identify new disease genes.

ResultsCharacterizing the Controllability of the Directed PPI Network. Weapplied linear control tools to access local controllability of PPInetworks whose dynamics are inherently nonlinear. The experi-mentally obtained network, however, can be assumed to capturelinear affects around homeostasis. Furthermore, given that thetools developed in ref. 14 are for linear dynamics, we are carefulto only assume that we can ascertain local controllability aroundhomeostasis. Controllability henceforth referred to local con-trollability (see SI Text for details).The directed human PPI network consists of 6,339 proteins

(nodes) and 34,813 directed edges, where the edge directioncorresponds to the hierarchy of signal flow between the inter-acting proteins and the edge weight corresponds to the con-fidence of the predicted direction. We applied structuralcontrollability theory to identify a minimum set of driver nodes(i.e., nodes through which we can achieve control of the wholenetwork). Note that the identified minimum driver node set(MDS) is not unique, but its size, denoted as ND, is uniquelydetermined by the network topology. We found that the MDS ofthe directed human PPI network contains 36% of nodes. We alsoclassified the nodes as indispensable, neutral, or dispensable,based on the change of ND upon their removal. A node is(i) indispensable if removing it increases ND (e.g., node 2 in Fig.1A), (ii) neutral if its removal has no effect on ND (e.g., node 1 inFig. 1A), and (iii) dispensable if its removal reduces ND (e.g.,nodes 3 and 4 in Fig. 1A). In the directed human PPI network,

21% of nodes are indispensable, 42% are neutral, and theremaining 37% are dispensable (Fig. 1B). Interestingly, we foundthat all of the three node types have a heterogeneous degreedistribution, and indispensable nodes tend to have higher in- andout-degrees compared with neutral and dispensable nodes (Fig. 1B and C). Similarly, indispensable nodes are associated withmore PubMed records (www.ncbi.nlm.nih.gov/pubmed) andGene Ontology (16) term annotation than neutral and dispens-able nodes (Fig. S1 A and B). However, the correlation betweenthe node-degree and the literature bias is weak (correlation co-efficient of 0.37 and 0.41 for in- and out-degree, respectively),suggesting that the higher degree of indispensable nodes is notexplained by the literature bias alone (Fig. S1 C and D).We characterized indispensable, neutral, and dispensable

nodes in the context of essentiality, evolutionary conservation,and regulation at the level of translational and posttranslationalmodifications (PTMs). Our gene essentiality analysis indicatedthat indispensable nodes are enriched in essential genes, whereasessential genes are underrepresented among dispensable nodes(Fig. 1E, Fig. S1E, and Dataset S1). Furthermore, indispensablenodes are evolutionarily conserved from human to yeast com-pared with the other two node types (Fig. 1E and Fig. S1F). Next,we analyzed the different node types in the context of cell sig-naling, which is at the core of cellular information processing. Ingeneral, known signaling proteins are enriched as indispensablenodes. However, dissecting different functional classes withinsignaling proteins reveals that kinases are enriched as in-dispensable nodes whereas membrane receptors and transcrip-tion factors are enriched as neutral nodes (Fig. 1E and Fig. S2A).Analysis of the protein steady-state abundance in cell lines, as ameasure of translational regulation, reveals that indispensablenodes are enriched as high copy number proteins, whereas low-copy number proteins show moderate enrichments for both

A

B

C

E

Dis

pen

sab

leN

eutr

alIn

dis

pen

sab

le

Z score

-5 50

High copy number

Medium copy number

Low copy number

Very low copy number

Transcription factors

Protein kinases

Membrane receptors

Signaling proteins

Ubiquitination

Phosphorylation (PY)

Phosphorylation (PS/P

T)

Acetylation

Ce

ll

Sig

na

ling

Pro

tein

A

bu

nd

an

ceP

TM

M

od

ifica

tio

n

Mus musculus

Danio rerio

Drosophila melanogaster

Caenorhabditis elegans

Saccharomyces cerevisiae

Essential genes

Evo

luti

on

ary

C

on

serv

ati

on

Indispensable

Neutral

Dispensable

Increases number of driver nodes

No change in number of driver nodes

Decreases number of driver nodes

Node type Number of nodes Effect on node removal

1330 (20.9%)

2662 (41.9%)

2347 (37.1%)

Human directed

PPI network

6339 nodes,

34,813 edges

Neutral

Indispensable

Dispensable

12

Neutral

Average out-degree0 4 8

Indispensable

Dispensable

Out-degree

p(k

)

0.001

0.005

0.020

0.100

0.500

1 2 5 10 20 50 100 200

Neutral

Average in-degree120 4 8

Indispensable

Dispensable

In-degree

p(k

)

5e−04

5e−01

1 2 5 10 20 50 100 200

2e−03

1e−02

1e−01

Neutral

Indispensable

Dispensable

D

Node classificationCharacterizing minimum driver node set (MDS)

Indispensable node

Neutral node

Dispensable node

1

2

3 4

Input network Driver node:

Real network

Effect on MDS upon node removal

Node 1 Node 2 Node 3 Node 4

1

2

3 4

1

2

3 4

1

2

3 4

|MDS| = 2

3 4

1

2

3 4

1

2

|MDS| = 2

3 4

1

2

|MDS| = 3

3 4

1

2

|MDS| = 1

3 4

1

2

|MDS| = 1

Fig. 1. Characterizing the controllability of human directed PPI network. (A) Schematic representation of the node classification using controllabilityframework. (B) Identification of indispensable, neutral, and dispensable nodes in human directed PPI network. (C) In-degree distribution and average in-degree for three different node types. (D) Out-degree distribution and average out-degree for three different node types. (E) Distinct enrichment profiles ofindispensable, neutral, and dispensable nodes in the context of essential genes, evolutionary conservation, cell signaling, protein abundance, and PTMs.

Vinayagam et al. PNAS | May 3, 2016 | vol. 113 | no. 18 | 4977

APP

LIED

BIOLO

GICAL

SCIENCE

S

Page 3: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

indispensable and dispensable nodes (Fig. 1E and Fig. S2B).Similarly, indispensable nodes are highly regulated throughPTM, including acetylation, ubiquitination, and phosphorylation(pS/pT and pY) (Fig. 1E and Fig. S2C). Altogether, our en-richment analyses revealed distinct functional and regulatoryroles for indispensable, neutral, and dispensable nodes.

Understanding Healthy to Disease State Transition Using NetworkControllability. We analyzed the node classification in the con-text of driving the system from healthy to disease condition andvice versa. Specifically, we analyzed the impact of three differenttransitions: (i) healthy to disease transition induced by mutationsor other genetic alterations; (ii) healthy to infectious transitioninduced by human viruses; and (iii) disease to healthy transitioninduced by drugs or small molecules. Note that our goal is todetermine whether specific node types (indispensable, neutral,or dispensable) are enriched for (i) disease-causing mutations,(ii) targets of human viruses, and (iii) drug targets.First, we analyzed 445 genes annotated by the Sanger Center as

causally implicated in oncogenesis (Cancer Gene Census; cancer.sanger.ac.uk/census) (17). Interestingly, we found that indispensablenodes are highly enriched in cancer genes, whereas neutral nodesshowed no enrichment and dispensable nodes are underrepresented(Fig. 2A, Cancer I; Fig. S3A; and Dataset S2). To ensure that theobserved enrichment of indispensable nodes is not attributableto the literature and degree bias, we repeated our analysis usingliterature- and degree-controlled random sets (SI Text). Afteradjusting for literature and degree bias (Fig. 2A, PubMed andDegree; and Dataset S2), indispensable nodes remain signifi-cantly enriched as cancer genes. Note that for enrichment analysisbelow, the degree- and literature-controlled enrichments resultswere shown in Fig. S3B. To further substantiate that indispensablenodes are enriched as cancer genes, we analyzed 3,164 genespredicted as cancer related genes (18) and observed a similar en-richment for indispensable nodes (Fig. 2B, Cancer II; and Fig. S3A).Next, we analyzed 1,403 genes annotated by Online Mendelian

Inheritance in Man (OMIM) (omim.org) as causal genes forvarious genetic diseases, aiming to test whether the perturbationof indispensable nodes is a specific feature of cancer or a generalfeature of human diseases. Our analysis showed that the per-turbation of indispensable nodes is a common feature of humandiseases (Fig. 2B, OMIM; and Fig. S3A). Interestingly, however,our analysis of disease genes identified from genome-wide as-sociation studies (GWAS) (www.genome.gov/gwastudies) (19)revealed poor enrichment for indispensable nodes (Fig. 2B,GWAS; and Fig. S3A), most likely reflecting the fact that GWASidentify genomic regions but not specific coding genes that causethe disease (20). Because indispensable nodes are enriched forcausal mutations (Fig. 2 A and B), our resource could helpidentify causal genes from GWAS.We also characterized the network controllability in the context

of host–parasite interactions, specifically human–virus interactions.

Upon infection, viruses control the host cellular network to use thehost resources to replicate and to evade the host immune response.Here, we analyzed the node types targeted by human viruses todrive the network from a healthy state to an infectious state. First,we analyzed the targets of HIV, a member of the lentivirus familythat causes AIDS. Putative human genes, identified to have aneffect on HIV-1 replication from large-scale functional genomicscreens (data compiled from four RNAi datasets) (21–24) tend tobe indispensable nodes (Fig. 2C, RNAi; and Fig. S3C). However,we did not detect a significant enrichment—most likely reflectingthe quality of the HIV RNAi screens (25). To analyze direct targetsof HIV, we compiled the HIV–human interactome (from recentliterature and PPI databases) (26, 27), finding that indispensablenodes are enriched for physical interactions with HIV proteins(Fig. 2C, PPIs; and Fig. S3C). Analysis of 208 different human–virus networks (26–29) reveals that human viruses commonly targetindispensable nodes to control the host network (Fig. 2C, Virustargets; and Fig. S3C). We noticed that after adjusting for literaturebias indispensable nodes remain as viral targets, whereas adjustingfor degree bias shows only weak enrichment (Fig. S3D). Thisfinding is in agreement with the previous observations that virusestend to target hubs (30).Finally, we characterized the network controllability in the

context of driving the system from disease to healthy state. Spe-cifically, we analyzed the node types that are targeted by the drugs/small molecules (Fig. 2D). By analyzing the targets of drugs ap-proved by the Food and Drug Administration (FDA) (31), wefound that indispensable nodes are enriched for drug targets (Fig.2D, FDA targets; and Fig. S3 E and F). Extending the analysis tothe list of proteins that are annotated as druggable (32), i.e., apresence of protein folds that favor interactions with drug-likechemical compounds, showed that the druggable genome list isnot significantly enriched for indispensable nodes (Fig. 2D, D I;and Fig. S3E). Interestingly, analyzing the druggable genome list byexcluding FDA-approved drug targets showed underrepresentationof indispensable nodes (Fig. 2D, D I; and Fig. S3E). This findingsuggests a potential application of our analysis to redefine thedruggable genome based on the network controllability.All of the above analyses of disease mutations, viruses, and

drugs consistently showed that indispensable nodes are preferredtargets. We also analyzed how often indispensable nodes act asdriver nodes by using a recently developed approach to identifythe role of each node as drivers in the MDSs (33). We found that378 nodes appear in all MDSs (i.e., they play roles in all of thecontrol configurations), 3,330 nodes are in some but not all MDSs(i.e., they play roles in some control configurations but the networkcan still be controlled without directly controlling them), and 2,631nodes do not belong to any MDS (i.e., they play no roles in control)(Dataset S1) (33). Interestingly, we found that indispensable nodesare never driver nodes in any MDS (Fig. S3G and Dataset S1).This fact can actually be rigorously proven (SI Text). Moreover,perturbing indispensable nodes increases the number of driver

A B

HIV targets

PPIs Virus

targets

Z s

core

−10

−5

0

5

10

15

RNAi

C D

OMIM GWAS

Z s

core

−10

−5

0

5

10

15

Cancer II

DispensableNeutralIndispensable

controlled random sets

Cancer I

Z s

core

−5

0

5

10

PubMed Degree

−10

15

Druggable genome

Z s

core

−5

0

5

FDA

targets

D I D II

−10

10

15

Fig. 2. Characterizing network controllability in transition from healthy to disease state. (A) Bar graph showing the enrichment results (z scores) of cancer genescompared with the random sets (Cancer I, cancer gene census) and the random sets controlled for literature (PubMed) or degree (Degree) bias. In the case of degree-or literature-controlled random sets, the random sets are sampled such that the average degree or average PubMed records of random sets matches the average ofnode type N. (B) Results from enrichment analysis of dataset corresponding to extended list of cancer genes (Cancer II), other human diseases (OMIM), and GWAS.(C) Results from enrichment analysis of the targets of HIV identified using RNAi screens (RNAi) and PPI networks (PPIs) and targets of other human virus (208 viruses).(D) Enrichment results from targets of FDA-approved drugs and druggable genome. DI, druggable genome; DII, druggable genome excluding FDA-approved targets.

4978 | www.pnas.org/cgi/doi/10.1073/pnas.1603992113 Vinayagam et al.

Page 4: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

nodes to control, suggesting that, from a controllability perspective,these nodes are fragile points in the network.We further analyzed indispensable nodes in specific signaling

pathways such as receptor tyrosine kinase (RTK) signalingpathways, which are commonly perturbed in cancer (34). Strik-ingly, 67 out of 170 RTK pathway members are indispensablenodes (P < 0.0001), including 51 indispensable nodes targeted bydisease mutations, viruses, or drugs (Fig. S4A and Dataset S2).Furthermore, we identified 21 indispensable nodes from differ-ent signaling pathways that are shared targets of cancer muta-tions, viruses, as well as drugs (Fig. S4B and Dataset S2).

Robustness of Indispensable Node Classification. The false-positiveand false-negative interactions are major concerns in PPI net-works, especially the false negatives because the current net-works are vastly incomplete (35). Hence, we systematicallyanalyzed the robustness of node classification with respect toadding or removing interactions. Specifically, we analyzed theindispensable node classification as a function of removing edges(or network filtering). The network filtering is achieved by usinga confidence score assigned to edge directions, where the moststringent filtering resulted in smaller high-confidence–directednetworks (20,151 edges and 5,317 nodes). We analyzed thecontrollability of filtered networks and compared them to theoriginal network. The results show that 90% of the indispensablenodes in the stringent filtered network are indispensable in theoriginal network (Fig. 3A, Fig. S5 A and B, and Dataset S3),suggesting that the indispensable node classification is robustwith respect to adding or removing edges in the network.Next, we analyzed the controllability of networks with per-

turbations (e.g., edge rewiring or edge-direction flipping). In thecase of random rewiring, up to 100% of the edges are rewired(node degrees are preserved), and in the case of direction-flippednetworks, up to 100% of the edge directions are reversed. Weobserved that up to 50% of indispensable nodes in the rewired ordirection-flipped network do not agree with the original annota-tion, showing that indispensability is highly sensitive to the con-nectivity pattern and edge direction (Fig. 3B, Fig. S5 C–F, andDataset S3). Comparing indispensable nodes of the real networkto that of the rewired (100% rewiring) and flipped (40% flipping)networks revealed two subtypes (type-I and type-II) of in-dispensable nodes (Fig. 3C and Dataset S3). If a node’s in-dispensability is robust to rewiring or flipping, then we call it atype-I node; if the node’s indispensability is sensitive to rewiring orflipping, then we refer to them as type-II nodes. We found that57% of indispensable nodes are type-I nodes and 43% are type-II.

Degree distribution of the subtypes shows that type-I nodestend to be hubs, whereas the average degree of type-II nodes issimilar to the average degree of the network (Fig. 3D). Indeed,type-II nodes cannot be distinguished from the rest of the nodesbased on any other network properties analyzed (Fig. S5G).Furthermore, type-I nodes show literature and annotation biascompared with type-II nodes (Fig. 3 E and F). With respect todiseases, both node types show similar enrichment for cancergenes and other human diseases (Fig. 3G). In contrast to type-Inodes that tend to be hubs and well-studied genes, type-II nodesare poorly studied and show no special network feature exceptindispensability, suggesting that control theory brings orthogonalinformation to traditional network analysis.

Applying Network Controllability Analysis to Mine Cancer GenomicData.Our finding that indispensable nodes (both type-I and type-II)are more likely to correspond to cancer genes prompted us tosystematically survey the perturbation of those genes in cancer.We analyzed data from 1,547 patients obtained from The CancerGenome Atlas (TCGA) (cancergenome.nih.gov) and cBioPortalfor Cancer Genomics (36), representing nine different cancertypes (Dataset S4). Specifically, we analyzed the amplification ordeletion of type-II indispensable nodes in nine cancer types.Note that the copy number alteration (CNA) data are normal-ized to the expression levels to identify the amplification or de-letion that results in expression level changes (SI Text). Weranked all genes based on the number of patients where the geneis amplified or deleted and selected the top 1% as frequentlyamplified/deleted genes; 56 type-II genes were identified as partof the top 1% of deleted/amplified genes in nine cancer types(Fig. 4A and Dataset S4). Strikingly, 10 of 56 type-II genes areknown cancer genes, an overlap that is highly significant (P =0.00002) (Fig. 4B and Fig. S6A). Interestingly, the frequency ofdeletion and amplification of type-II indispensable nodes is notsignificantly enriched compared with random sets, an observationthat was similar to cancer gene census gene list (Dataset S4).Furthermore, we compared the type-II genes with results from acell proliferation screen (37) that identified a subset of genes thatregulate cell proliferation (“GO” genes induce the proliferationand “STOP” genes suppress the proliferation); 17 of 56 genesrepresent regulators of cell proliferation (11 GO genes, 8 STOPgenes, and 2 genes part of both GO and STOP genes) (Fig. 4C andFig. S6 B and C). The overlap between type-II genes and GO genesare statistically significant (P = 0.0003). Of 56 genes, 10 genes arefrequently perturbed in multiple cancer types [e.g., proteasome 26Ssubunit, non-ATPase, 4 (PSMD4) in four different cancers], andall of them show similar deletion or amplification profile (e.g.,

57.5%

12.2%

22.6%

7.7%

Type-I indispensable nodes

(robust to edge rewiring and flipping direction)

Sensitive to both

edge rewiring and

flipping direction

Sensitive

to flipping

direction

Sensitive to

edge rewiringType

-II in

dis

pe

nsa

ble

no

de

s

(se

nsi

tive

to

ne

two

rk t

op

olo

gy)A B

D E

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

Fraction of edges randomized

Fra

ctio

n o

f u

nch

an

ge

d

ind

isp

en

sab

le n

od

es

RewiredFlipped

Edge score cutoff0.5 0.6 0.7 0.8 0.9

0.4

0.6

0.8

1.0

Fra

ctio

n o

f u

nch

an

ge

d

ind

isp

en

sab

le n

od

es

Filtered

C

F

Av

era

ge

no

de

de

gre

e

10

15

20

In-degree Out-degree0

5

Type-IType-II

G250

200

150

100

50

0

Ave

rag

e P

ub

Me

d r

eco

rds

PubMed

Type-IType-II

Ave

rag

e G

O t

erm

s

30

20

10

0GO terms

Type-IType-II

Cancer OMIM

Z s

core

0

2

4

6

8Type-IType-II

Fig. 3. Perturbation of network connectivity reveals two subtypes of indispensable nodes (type-I and type-II). (A) Plot showing the fraction of indispensablenodes in filtered networks that overlaps with the real network. The network filtering achieved using edge confidence score. (B) Fraction of indispensablenodes in rewired or direction-flipped overlap with the real network. (C) Identification of type-I and type-II indispensable nodes. The average node degree (D),PubMed record association (E), and Gene Ontology (GO) term annotations (F) for type-I and type-II indispensable nodes. (G) Enrichment of type-I and type-IIindispensable nodes as cancer genes and OMIM disease genes.

Vinayagam et al. PNAS | May 3, 2016 | vol. 113 | no. 18 | 4979

APP

LIED

BIOLO

GICAL

SCIENCE

S

Page 5: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

PSMD4 amplified in all four cancers) (Fig. 4D). Almost half ofthe genes (23 genes) are poorly studied, with less than 50 asso-ciated PubMed records; for instance, small G protein signalingmodulator 2 (SGSM2) is associated with only eight PubMedrecords (Fig. 4D). These contextual evidences, along with theindispensability, suggest that these 46 type-II nodes could bepotential cancer genes.

Database of Directed PPI Network with Predicted Controllability. Wecreated the DirectedPPI database (www.flyrnai.org/DirectedPPI)to navigate the directed human PPI network with predictedcontrollability. Users can enter a gene or upload a list of genesand our tool generates a network with directed edges connectingthe input list. Our tool also accepts gene list with values (e.g.,mutation frequency, P values from GWAS, or expression changes).Three different node types (indispensable, neutral, and dispensable)are distinguished with node shape and color and for these nodes allof the properties analyzed in this article are displayed. This tool willbe useful to analyze disease datasets and other high-throughputdatasets to identify indispensable nodes and their interconnections.

DiscussionStudying the controllability of a complex biological network israther difficult, because of the fact that we typically do not knowthe true functional form of the underlying dynamics. However,most biological systems operate near homeostasis, so localproperties are indeed what we want to ascertain. Here, weshowed that application of linear control tools to study the localstructural controllability of inherent nonlinear biological net-works provides meaningful predictions. Furthermore, we dem-onstrated that local controllability tools help identifies knownhuman diseases genes and this can be used to identify noveldisease genes and drug targets.Our analysis of directed human PPI network identifies 36% of

the nodes as driver nodes, which is similar to what has been ob-served in metabolic networks (∼30%) (14). The node classificationbased on network controllability shows distinct biological prop-erties in the context of essentiality, conservation and regulation.Specifically, we found that indispensable nodes are well conserved,highly regulated at the level of translational and PTMs, and im-portant for the transition between healthy and disease states. In-terestingly, this enrichment pattern is partially shared by the nodes inthe minimum dominating sets that are located in strategically im-portant positions in controlling the network (38–40). Furthermore,

identification of the indispensable nodes as primary targets ofdiseases causing mutations, viruses, and drugs revealed a potentialapplication of this framework to identify novel disease genes andpotential drug targets.Interestingly, disease-causing mutations, viruses, and drugs target

fragile points (indispensable nodes) that determine the number ofdriver nodes rather than the driver nodes themselves, suggestingthat network controllability is crucial in transitioning betweenhealthy and disease states. Although network topology-based prop-erties such as hubs and modules are commonly used to identifydisease genes (41–44), the controllability perspective provides acomplementary network analysis framework for network medicine.In particular, type-II nodes that are not distinguishable from existingnetwork properties and without publications bias were still identi-fied by our controllability framework as nodes of special interest.We envision that in the future, improving the quality and the com-pleteness of interactome maps and integrating dynamics ofnetwork components would hugely impact our understandingof biological networks both in the context of biological functionand human disease.

MethodsDatasets and Enrichment Analysis. All of the datasets used for the enrichmentanalysis in this study are listed in Dataset S2. Details on the directed humanPPI network, random networks, and datasets used for enrichment analysiscan be found in SI Text.

Controllability Analysis and Node Classification. To identify MDS, with sizedenoted as ND, whose control is sufficient to ensure the structural controlla-bility of linear dynamics (14) and local structural controllability for nonlineardynamics (SI Text) on any directed weighted network, we can map the struc-tural controllability problem in control theory to the maximum matchingproblem in graph theory, which can be solved in polynomial time (15).

After a node is removed, denote the minimum number of driver nodes ofthe damaged network as ND′. We can classify nodes into three categories:(i) a node is indispensable if in its absence we have to control more drivernodes (i.e., ND′ > ND); (ii) a node is dispensable if in its absence we have ND′ <ND); and (iii) a node is neutral if in its absence ND′ = ND. Note that in-dispensable nodes are never driver nodes in any control configurations orMDSs, which can be proven by contradiction (SI Text). More informationon node classification and local structural controllability can be found inSI Text.

Random Networks. To compare the real network with its randomizedcounterparts, we performed two types of randomization: (i) edge rewiring:

TUBG1

F11R

UCEC

BRCASMARCA2

AURKA

PPP1R16A

H3F3A

TRIP13

ETF1

NUP98

RNF4PTEN

AP3B1

CCT5

ERBB2IP

BLMH

NPM1

RNF19A

RAD23BL3MBTL2

AKAP10

OV

GLI3

GBM

DOCK1

MED10

BTF3ITCHCSE1L

CCT3

LUAD

MARK3CTDP1

COAD

PSMC1CDKN2A

ATRPIK3CA

FLII

LUSC

PPP2CB

PAFAH1B2

GNA11AP2B1

HERPUD1TAF1C

POLR2CENAH

STK11

PRPF6

ZNF24SOCS6

RALBP1

SGSM2UBAC2

CHMP2A

NR2F6ING4

LGG

RPL18A

UBE2D2KIRC

PSMD4

PIP5K1A

Type-II nodes not part of Cancer genes

Type-II nodes part of Cancer genes

or Identified as regulator of cell proliferation

D

0

2

4

6

8

BRCA

COAD

GBMKIR

CLGG

LUADLUSC O

V

UCEC

AmplificationDeletion

Nu

mb

er

of

Typ

e-I

I ge

ne

s

A

B

C

46 10 435

Type-II nodesfrequently

deleted or

amplified

Cancer Gene

Census

Significant overlap

(p-value 0.00002)

398 3753

11 1089Type-II nodes

frequently

deleted or

amplified

Significant overlap

(p-value 0.0003)

Overlap is not significant

STOP genes

GO genes Regulators

of cell

preliferation

Fig. 4. Applying network controllability to mine cancer genomic data. (A) Type-II genes frequently amplified or deleted in cancer patients (part of top 1% genes). Thebar plot shows number of type-II genes deleted/amplified in brain lower grade glioma (LGG), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), lungsquamous cell carcinoma (LUSC), ovarian serous cystadenocarcinoma (OV), uterine corpus endometrial carcinoma (UCEC), breast invasive carcinoma (BRCA), colonadenocarcinoma (COAD), and glioblastoma multiforme (GBM) cancers. (B) Overlap between frequently deleted/amplified type-II genes and known cancer genes.(C) Overlap between frequently deleted/amplified type-II genes and regulators of cell proliferation (STOP genes reduces cell proliferation and GO genes increases cellproliferation). The P values show the significance of overlap calculated based on 1000 random sets. (D) Network representation of 56 type-II genes frequently deleted(red edge) or amplified (blue edge) in nine different cancer types. The node size corresponds to the number of PubMed records associated with the gene.

4980 | www.pnas.org/cgi/doi/10.1073/pnas.1603992113 Vinayagam et al.

Page 6: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

we randomly choose a p fraction of edges to rewire, using the degree-preserving random rewiring algorithm (45); and (ii) edge flipping: we ran-domly choose a p fraction of edges to flip their directions. We tune p from0 up to 1, resulting in a series of randomized networks.

Analysis of Cancer Genomic Datasets. Copy number alteration data for ninecancer types were downloaded from the cBioPortal for Cancer Genomics(www.cbioportal.org). Gene expression data for each cancer type weredownloaded from TCGA (https://tcga-data.nci.nih.gov/tcga). Next, we filteredfor patients with both CNA and expression data available (details are availablein Dataset S4). We computed a z score for each gene in a patient to identify

whether the amplification or deletion results in expression change for thecorresponding gene. A gene is defined as amplified if the Genomic Identifi-cation of Significant Targets in Cancer (GISTIC) score is ≥1 and the z score is ≥1.5and deleted if the GISTIC score (46) is ≤−1 and the z score is ≤−1.5 (SI Text).

ACKNOWLEDGMENTS. We thank M. W. Kirschner, S. E. Mohr, I. T. Flockhart,S. Rajagopal, and Bingbo Wang for helpful suggestions. This work wassupported by National Institutes of Health (NIH) Grants P01-CA120964,Centers of Excellence of Genomic Science (CEGS) 1P50HG004233, and1R01HL118455-01A1 and John Templeton Foundation Awards PFI-777 and51977. N.P. is supported by the Howard Hughes Medical Institute.

1. Isidori A (1995) Nonlinear Control Systems (Springer, Berlin, New York), 3rd Ed.2. Kalman RE (1963) Mathematical description of linear dynamical systems. J Soc Indust

Appl Math Ser A Control 1(2):152–192.3. Slotine JJE, Li W (1991) Applied Nonlinear Control (Prentice Hall, Englewood Cliffs,

NJ).4. Iglesias PA, Ingalls BP (2010) Control Theory and Systems Biology (MIT Press, Cam-

bridge, MA).5. Del Vecchio D, Murray RM (2015) Biomolecular Feedback Systems (Princeton Univ

Press, Princeton).6. Balázsi G, van Oudenaarden A, Collins JJ (2011) Cellular decision making and bi-

ological noise: From microbes to mammals. Cell 144(6):910–925.7. Luenberger DG (1979) Introduction to Dynamic Systems: Theory, Models, and Appli-

cations (Wiley, New York).8. Csete ME, Doyle JC (2002) Reverse engineering of biological complexity. Science

295(5560):1664–1669.9. Mangan S, Alon U (2003) Structure and function of the feed-forward loop network

motif. Proc Natl Acad Sci USA 100(21):11980–11985.10. Callura JM, Cantor CR, Collins JJ (2012) Genetic switchboard for synthetic biology

applications. Proc Natl Acad Sci USA 109(15):5850–5855.11. Nepusz T, Vicsek T (2012) Controlling edge dynamics in complex networks. Nat Phys

8(7):568–573.12. Kiel C, Yus E, Serrano L (2010) Engineering signal transduction pathways. Cell 140(1):33–47.13. Vinayagam A, et al. (2011) A directed protein interaction network for investigating

intracellular signal transduction. Sci Signal 4(189):rs8.14. Liu YY, Slotine JJ, Barabási AL (2011) Controllability of complex networks. Nature

473(7346):167–173.15. Hopcroft JE, Karp RM (1973) An n̂ {5/2} algorithm for maximum matchings in bi-

partite graphs. SIAM J Comput 2:225–231.16. Ashburner M, et al.; The Gene Ontology Consortium (2000) Gene ontology: Tool for

the unification of biology. Nat Genet 25(1):25–29.17. Futreal PA, et al. (2004) A census of human cancer genes. Nat Rev Cancer 4(3):177–183.18. Higgins ME, Claremont M, Major JE, Sander C, Lash AE (2007) CancerGenes: A gene

selection resource for cancer genome projects. Nucleic Acids Res 35(Database issue):D721–D726.

19. Hindorff LA, et al. (2009) Potential etiologic and functional implications of genome-wideassociation loci for human diseases and traits. Proc Natl Acad Sci USA 106(23):9362–9367.

20. McCarthy MI, Hirschhorn JN (2008) Genome-wide association studies: Potential nextsteps on a genetic journey. Hum Mol Genet 17(R2):R156–R165.

21. Brass AL, et al. (2008) Identification of host proteins required for HIV infectionthrough a functional genomic screen. Science 319(5865):921–926.

22. König R, et al. (2008) Global analysis of host-pathogen interactions that regulateearly-stage HIV-1 replication. Cell 135(1):49–60.

23. Yeung ML, Houzet L, Yedavalli VS, Jeang KT (2009) A genome-wide short hairpin RNAscreening of jurkat T-cells for human proteins contributing to productive HIV-1 rep-lication. J Biol Chem 284(29):19463–19473.

24. Zhou H, et al. (2008) Genome-scale RNAi screen for host factors required for HIVreplication. Cell Host Microbe 4(5):495–504.

25. Goff SP (2008) Knockdown screens to knockout HIV-1. Cell 135(3):417–420.26. Jäger S, et al. (2011) Global landscape of HIV-human protein complexes. Nature

481(7381):365–370.27. Zanzoni A, et al. (2002) MINT: A Molecular INTeraction database. FEBS Lett 513(1):

135–140.28. Navratil V, et al. (2009) VirHostNet: A knowledge base for the management and the

analysis of proteome-wide virus-host interaction networks. Nucleic Acids Res 37(Databaseissue):D661–D668.

29. Rozenblatt-Rosen O, et al. (2012) Interpreting cancer genomes using systematic hostnetwork perturbations by tumour virus proteins. Nature 487(7408):491–495.

30. Calderwood MA, et al. (2007) Epstein-Barr virus and virus human protein interactionmaps. Proc Natl Acad Sci USA 104(18):7606–7611.

31. Knox C, et al. (2011) DrugBank 3.0: A comprehensive resource for ‘omics’ research ondrugs. Nucleic Acids Res 39(Database issue):D1035–D1041.

32. Hopkins AL, Groom CR (2002) The druggable genome. Nat Rev Drug Discov 1(9):727–730.33. Jia T, et al. (2013) Emergence of bimodality in controlling complex networks. Nat

Commun 4:2002.

34. Blume-Jensen P, Hunter T (2001) Oncogenic kinase signalling. Nature 411(6835):355–365.35. Venkatesan K, et al. (2009) An empirical framework for binary interactome mapping.

Nat Methods 6(1):83–90.36. Gao J, et al. (2013) Integrative analysis of complex cancer genomics and clinical

profiles using the cBioPortal. Sci Signal 6(269):pl1.37. Solimini NL, et al. (2012) Recurrent hemizygous deletions in cancers may optimize

proliferative potential. Science 337(6090):104–109.38. Wuchty S (2014) Controllability in protein interaction networks. Proc Natl Acad Sci

USA 111(19):7156–7160.39. Nacher JC, Akutsu T (2012) Dominating scale-free networks with variable scaling ex-

ponent: Heterogeneous networks are not difficult to control. New J Phys 14(7):073005.40. Nacher JC, Akutsu T (2014) Analysis of critical and redundant nodes in controlling

directed and undirected complex networks using dominating sets. J Complex Netw2(4):394–412.

41. Rambaldi D, Giorgi FM, Capuani F, Ciliberto A, Ciccarelli FD (2008) Low duplicabilityand network fragility of cancer genes. Trends Genet 24(9):427–430.

42. Barabási AL, Gulbahce N, Loscalzo J (2011) Network medicine: A network-based ap-proach to human disease. Nat Rev Genet 12(1):56–68.

43. Taylor IW, et al. (2009) Dynamic modularity in protein interaction networks predictsbreast cancer outcome. Nat Biotechnol 27(2):199–204.

44. Jin Y, et al. (2012) A systems approach identifies HIPK2 as a key regulator of kidneyfibrosis. Nat Med 18(4):580–588.

45. Maslov S, Sneppen K (2002) Specificity and stability in topology of protein networks.Science 296(5569):910–913.

46. Beroukhim R, et al. (2007) Assessing the significance of chromosomal aberrations incancer: Methodology and application to glioma. Proc Natl Acad Sci USA 104(50):20007–20012.

47. Sontag ED (1998) Mathematical Control Theory: Deterministic Finite DimensionalSystems (Springer, New York), 2nd Ed.

48. Sastry S (1999) Nonlinear System: Analysis, Stability, and Control (Springer, NewYork).

49. Lin C-T (1974) Structural controllability. IEEE Trans Automat Contr 19(3):201–208.50. Luo H, Lin Y, Gao F, Zhang CT, Zhang R (2014) DEG 10, an update of the database of

essential genes that includes both protein-coding genes and noncoding genomic el-ements. Nucleic Acids Res 42(Database issue):D574–D580.

51. Chen WH, Minguez P, Lercher MJ, Bork P (2012) OGEE: An online gene essentialitydatabase. Nucleic Acids Res 40(Database issue):D901–D906.

52. Hu Y, et al. (2011) An integrative approach to ortholog prediction for disease-focusedand other functional studies. BMC Bioinformatics 12:357.

53. Hornbeck PV, et al. (2012) PhosphoSitePlus: A comprehensive resource for investigating thestructure and function of experimentally determined post-translational modifications inman and mouse. Nucleic Acids Res 40(Database issue):D261–D270.

54. Ben-Shlomo I, Yu Hsu S, Rauch R, Kowalski HW, Hsueh AJ (2003) Signaling receptome:A genomic and evolutionary perspective of plasma membrane receptors involved insignal transduction. Sci STKE 2003(187):RE9.

55. Park J, et al. (2005) Building a human kinase gene repository: Bioinformatics, mo-lecular cloning, and functional validation. Proc Natl Acad Sci USA 102(23):8114–8119.

56. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S (2002) The protein kinasecomplement of the human genome. Science 298(5600):1912–1934.

57. Messina DN, Glasscock J, Gish W, Lovett M (2004) An ORFeome-based analysis ofhuman transcription factor genes and the construction of a microarray to interrogatetheir expression. Genome Res 14(10B):2041–2047.

58. Beck M, et al. (2011) The quantitative proteome of a human cell line. Mol Syst Biol7:549.

59. Woodsmith J, Kamburov A, Stelzl U (2013) Dual coordination of post translationalmodifications in human protein networks. PLOS Comput Biol 9(3):e1002933.

60. Fazekas D, et al. (2013) SignaLink 2 - a signaling pathway resource with multi-layeredregulatory networks. BMC Syst Biol 7:7.

61. Shannon P, et al. (2003) Cytoscape: A software environment for integrated models ofbiomolecular interaction networks. Genome Res 13(11):2498–2504.

62. Doncheva NT, Assenov Y, Domingues FS, Albrecht M (2012) Topological analysis andinteractive visualization of biological networks and protein structures. Nat Protoc7(4):670–685.

Vinayagam et al. PNAS | May 3, 2016 | vol. 113 | no. 18 | 4981

APP

LIED

BIOLO

GICAL

SCIENCE

S

Page 7: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

Supporting InformationVinayagam et al. 10.1073/pnas.1603992113SI TextDirected Human PPI Network. The directed human PPI network wascompiled from our previous study (13). Briefly, a Naïve Bayesianclassifier was applied to predict potential direction of signal flowbetween the ith and jth interacting proteins pi and pj as pi → pj, pj →pi or both. The classifier uses features derived from the shortestPPI paths between membrane receptors and transcription factorsand assigns confidence for each predicted edge directions rangingfrom 0.5 to 1. The weighted and directed edges are then encodedin an N × N matrix, and A is denoted as the weighted adjacencymatrix of the directed graph for the PPI network. The element ofA in the ith row and jth column is denoted as aij and is defined asfollows: aij is in the range [0.5,1] if there is signal flow fromprotein pj to pi otherwise aij = 0.

Controllability Analysis and Node Classification. Recently, we de-veloped amathematical framework and analytical tools to identifyMDS, with size denoted as ND, whose control is sufficient toensure the structural controllability of linear dynamics (14) andlocal structural controllability for nonlinear dynamics (SI Text,Local Structural Controllability) on any directed weighted net-work. This is achieved by mapping the structural controllabilityproblem in control theory to the maximum matching problem ingraph theory, which can be solved in polynomial time (15). Here,an edge subset M in a directed network or digraph is called a“matching” if no two edges in M share a common starting nodeor a common ending node. A node is “matched” if it is an endingnode of an edge in the matching. Otherwise, it is “unmatched.”A matching of maximum cardinality is called a “maximummatching.” (In general, there could be many different maximummatchings for a given digraph.) We proved that the unmatchednodes that correspond to any maximum matching can be chosenas driver nodes to control the whole network. Identifying aminimum set of driver nodes is equivalent to choosing an inputmatrix (often denoted as B) with the minimum number of col-umns (see SI Text, Local Structural Controllability and ref. 14 formore details). The detailed construction of the input matrix B isnot necessary for the identification of driver nodes. This is onlymentioned to connect the notion of a driver node to the theo-retical discussions in SI Text, Local Structural Controllability.After a node is removed, denote theminimum number of driver

nodes of the damaged network as ND′. In this work, we classifiednodes into three categories. (i) A node is indispensable if in itsabsence we have to control more driver nodes (i.e., ND′ > ND).For example, removal of any node in the middle of a directedpath will break the path and cause the ND increase. Hence, allbut the start and end nodes of a directed path are indispensable.(ii) A node is dispensable if in its absence we have ND′ < ND. Forexample, removal of one leaf node in a star will decrease ND by1. (iii) A node is neutral if in its absence ND′ = ND. For example,removal of the central hub in a star will not change ND at all.Note that a driver node in any MDS can never be an in-

dispensable node. This can be proven by contradiction. Assumea driver node i is indispensable. According to the minimuminput theorem (9), driver nodes are just unmatched nodes withrespect to a particular maximum matching. There are two ca-ses; case 1: the driver node i has no downstream neighbors [i.e.,kout(i) = 0], then in its absence, ND′ = ND − 1; and case 2: thedriver node i has at least one downstream neighbors [i.e.,kout(i) > 0]. There are two subcases; case 2.1: if in the maxi-mum matching, one of node i’s downstream neighbors (node j)is matched by node i, then in the absence of node i, node j will

become unmatched (i.e., a new driver node), rendering ND′ = ND;case 2.2: if none of node i’s downstream neighbors are matched bynode i, then in the absence of node i, ND′ = ND − 1. In all of thecases, we do not have ND′ > ND, which is in contrast to the defi-nition of indispensable nodes. Hence, driver nodes cannot be in-dispensable.

Enrichment Analysis. To estimate the significance of overlap be-tween a given node type S and given dataset D, we compute anenrichment z score as

z  score=ðSD −mean  of   RDÞ

SD  of   RD,

where SD is number of proteins from dataset D overlapping withnode type S and RD is the number of proteins from dataset Doverlapping with random set of proteins of same size as N. Meanand SD of RD is computed from 1,000 simulations of random sets.Note that the entire network with 6,339 proteins is used as thebackground for random sampling. In addition to the z score, wealso computed the P value (two-tailed) by comparing the SD withRD distribution (modeled as Gaussian distribution). In the case ofdegree- or literature-controlled random sets, the random sets aresampled such that the average degree or average PubMed recordsof random sets matches the average of node type S.

Datasets Used for Enrichment Analysis. All of the datasets used forthe enrichment analysis in this study are listed in Dataset S2.This includes the source of the data, reference, number ofproteins compiled, and overlap with human directed PPI net-work. The datasets were downloaded from respective databasesor publications as mentioned in Dataset S2. The gene or pro-tein IDs from various resources were mapped to Entrez geneIDs. All compiled datasets are available as an integrated tablethat shows the nodes and the overlap with respective datasets(Dataset S1).

Analysis of Cancer Genomic Datasets. Copy number alteration datafor nine cancer types were downloaded from the cBioPortal forCancer Genomics (version corresponds to April 2013; www.cbioportal.org). Using the GISTIC algorithm (46), the cBio-Portal provides putative values of copy number alterations foreach cancer patient. The GISTIC score −2, −1, 0, 1, 2 corre-sponds to deep loss (possibly a homozygous deletion), single-copy loss (heterozygous deletion), diploid, low-level gain, andhigh-amplification, respectively. The gene expression data foreach cancer type were downloaded from the TCGA (versioncorresponds to April 2013; https://tcga-data.nci.nih.gov/tcga).The tumor-matched datasets (for each participant have beenanalyzed and compared with normal tissue on the CNA and geneexpression level) were used in the analysis. Level 3 TGCA data(expression calls for genes, per sample) was used in our study.The TCGA data were downloaded by using TCGA web interfacewith filters set as “Data Type: Expression-Genes”; “Data Level:Level 3”; “Tumor/Normal: Tumor-matched.”Next, we filtered for patients with both CNA and expression

data available (details are available in Dataset S4). We computeda z score for each gene in a patient to identify whether theamplification or deletion results in expression change for thecorresponding gene. Briefly, for each gene the diploid mean andSD of expression values were calculated using the data frompatients without any copy number alteration (GISTIC score, 0;

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 1 of 8

Page 8: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

diploid). Using the diploid mean and SD, we computed z scorefor each gene in a given patient. A gene is defined as amplified ifthe GISTIC score is ≥1 and the z score is ≥1.5 and deleted if theGISTIC score is ≤−1 and the z score is ≤−1.5. All of the datapreprocessing and normalization were performed using Perl andJava scripts developed in house.

Local Structural Controllability. A dynamic system is controllableif, with a suitable choice of inputs, it can be driven from anyinitial state to any final state in finite time (2). Most complexbiological systems are characterized by nonlinear interactionsbetween the components, and often only local properties canbe verified. Similarly, it is often easier to obtain local analyticalresults for controllability of nonlinear systems. Here, we reviewa sufficient condition for “local controllability” of a nonlinearsystem about a trim point. A system is “locally controllable” ifthere exists a neighborhood in the state space such that allinitial conditions in that neighborhood are controllable to allother elements in the neighborhood with locally boundedtrajectories (47). This definition of controllability can beverified by checking the well-known “Kalman rank condition”used in the controllability analysis of linear systems. The restof the section is tutorial in fashion so as to illustrate how theadjacency matrix can be used to analyze the local controlla-bility of a PPI network.Consider a dynamic system governed by a set of ordinary

differential equations

dxdt

= f ðxðtÞÞ,

where x= ½x1   x2⋯  xn�T is the state vector and t is time. We areinterested in determining an n×m matrix B such that the con-trolled system

dxdt

= f ðxðtÞÞ+Bu [1]

is locally controllable through the input u= ½u1   u2 . . .   um�T  .Let zp be defined as f ðzpÞ= 0, AðzpÞ= ∂f

∂x ðzpÞ, and GðzpÞ=½B AB⋯  An−1B�.The matrix G(z*) is referred to as the Kalmancontrollability matrix. The dynamics in Eq. 1 are locally con-trollable around zp if GðzpÞ is rank n (Theorem 7 in ref. 47;Proposition 11.2 in ref. 48). The local controllability analysisof Eq. 1 about a trim point therefore reduces to the classicKalman controllability analysis (2) of the linear dynamics

dðzðtÞ− zpÞdt

=AðzðtÞ− zpÞ+Bu. [2]

Recall that the dynamics in Eq. 2 are deemed “structurally con-trollable” if there exists another pair (A0,B0) with the same struc-ture as the pair (A,B) (49). That is, we are not concerned aboutthe particular values in (A,B), just the pattern of the nonzeroentries in (A,B). The dynamics in Eq. 1 are deemed “locallystructurally controllable” if the linearized dynamics in Eq. 2are structurally controllable.For the purposes of this work, the adjacency matrix of the

experimentally determined PPI network is used to find thestructure of A in ref. 2, then the nodes are classified based upontheir impact on the structural controllability (Methods).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 2 of 8

Page 9: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

Essential genes

Den

sity

300 350 400 450 500 550 600

00.

015

Essential genes

Den

sity

400 500 600 700

00.

015

Essential genes

Den

sity

600 650 700 750 800 850

00.

015

Dispensable Neutral Indispensable

Dispensable Neutral Indispensable

Av

era

ge

nu

mb

er

of

Pu

bM

ed

re

cord

s

0

50

100

150

200

Av

era

ge

nu

mb

er

of

Ge

ne

On

tolo

gy

term

s

0

5

10

15

20

25

30

Dispensable Neutral Indispensable

a b

Conserved in Mouse

De

nsi

ty

1260 1280 1300 1320 1340

0.0

00

.06

Conserved in Fish

De

nsi

ty

1200 1220 1240 1260 1280 1300

0.0

00

.03

Conserved in Fly

De

nsi

ty

900 950 1000 1050 1100 1150 1200

0.0

00

0.0

15

Conserved in Worm

De

nsi

ty

850 900 950 1000 1050 1100 1150

0.0

00

0.0

15

De

nsi

ty

450 500 550 600 650

0.0

00

0.0

15

Conserved in Yeast

Conserved in Mouse

De

nsi

ty

2280 2300 2320 2340

0.0

00

.04

0.0

8

Conserved in Mouse

De

nsi

ty

2550 2600 2650 2700

0.0

00

.06

Conserved in Fish

De

nsi

ty

2100 2150 2200 2250 2300

0.0

00

.02

0.0

4

Conserved in Fish

De

nsi

ty

2400 2450 2500 2550 2600

0.0

00

.02

0.0

4

Conserved in Fly

De

nsi

ty

1600 1700 1800 1900 2000 2100

0.0

00

0.0

15

Conserved in Fly

De

nsi

ty

1900 2000 2100 2200 2300

0.0

00

0.0

15

De

nsi

ty

1600 1700 1800 1900 2000

0.0

00

0.0

15

De

nsi

ty

1800 1900 2000 2100 2200

0.0

00

0.0

15

De

nsi

ty

750 800 850 900 950 1000 1050 1100

0.0

00

0.0

15

De

nsi

ty

900 950 1000 1050 1100 1150 1200

0.0

00

0.0

15

Conserved in Worm Conserved in Worm

Conserved in Yeast Conserved in Yeast

Dispensable Neutral Indispensablef

In-degree

Pu

bM

ed

re

cord

s

0 50 100 150 200

1

5

10

100

50

500

1000

5000

Correlation coefficient = 0.37

Out-degree

Pu

bM

ed

re

cord

s

1

5

10

100

50

500

1000

5000

0 20 100 12040 60 80

Correlation coefficient = 0.41

c

d

e

Fig. S1. (A and B) Literature and annotation bias for the three node types. Bar plots show average PubMed records associated (A) and Gene Ontology termsannotated (B) for each node type. (C and D) Correlation of node degree vs. literature bias. The plots show the correlation of in-degree (C) and out-degree (D) tothe number of PubMed records associated with each node in the entire network. (E) Enrichment analysis of essential genes. Numbers of essential genesoverlapping with dispensable, neutral, and indispensable nodes are shown in red arrows. The essential genes are compiled from the Database of EssentialGenes (DEG) (tubic.tju.edu.cn/deg) (50) and Online GEne Essentiality database (OGEE) (ogeedb.embl.de) (51). Numbers of essential genes overlapping with size-controlled random sets are shown in gray bars. (F) Enrichment analysis of conserved genes. Numbers of genes conserved in Mus musculus (mouse), Danio rerio(fish), Drosophila melanogaster (fly), Caenorhabditis elegans (worm), and Saccharomyces cerevisiae (yeast) are shown in red arrows, and their respective size-controlled random set distributions are shown in gray bars. The ortholog mapping was performed using the Drosophila RNAi Screening Center (DRSC) In-tegrative Ortholog Prediction Tool (DIOPT) (www.flyrnai.org/DIOPT) (52).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 3 of 8

Page 10: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

Copy Signaling

PTM

De

nsi

ty0

0.0

4D

en

sity

00

.01

5D

en

sity

00

.02

0D

en

sity

00

.02

0

bDispensable Neutral Indispensable

High copy number80 100 120 140 160

180 220 260 300

120 140 160 180 200 220

120 140 160 180 200

Moderate copy number

Low copy number

Very low copy number

140 160 180 200 220 240 160 200 240 280

350 400 450 500 400 450 500 550

250 300 350 250 300 350 400

220 260 300 340 250 300 350

High copy number High copy number

Moderate copy number Moderate copy number

Low copy number Low copy number

Very low copy number Very low copy number

De

nsi

ty0

0.0

4D

en

sity

00

.02

0D

en

sity

00

.02

0D

en

sity

00

.02

De

nsi

ty0

0.0

20

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

20

De

nsi

ty0

0.0

15

Signaling Proteins

De

nsi

ty0

0.0

20

Membrane Receptors

De

nsi

ty0

0.0

4Proteins Kinases

De

nsi

ty0

0.0

3

Transcription Factors

De

nsi

ty0

0.0

20

Signaling Proteins

De

nsi

ty0

0.0

15

Signaling Proteins

De

nsi

ty0

0.0

15

Membrane Receptors

De

nsi

ty0

0.0

20

Membrane Receptors

De

nsi

ty0

0.0

2

Proteins Kinases

De

nsi

ty0

0.0

4

Proteins Kinases

De

nsi

ty0

0.0

4

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

15

Transcription Factors Transcription Factors

Dispensable Neutral Indispensable

150 200 250 300 350 400

60 80 100 120 140 160

40 60 80 100 120 140

180 220 260 300

200 250 300 350 400 450 350 400 450 500

100 150 200 250 150 200 250 300 350 400

80 100 120 140 160 180 100 120 140 160 180

300 350 400 450 500 400 450 500 550 600

a

All PTMs

De

nsi

ty

950 1050 1150 1250

00

.01

5

Acetylation

De

nsi

ty

250 300 350 400 450

00

.01

5

Phosphorylation S/T

De

nsi

ty

800 900 1000 1100

00

.01

5

Ubiquitination

De

nsi

ty

500 600 700 800

00

.01

5

Phosphorylation Y

All PTMs

De

nsi

ty

1600 1800 2000

00

.01

5

All PTMs

De

nsi

ty

1900 2100 2300

00

.01

5

Acetylation

De

nsi

ty

450 500 550 600 650 700

00

.01

5

Acetylation

De

nsi

ty

550 600 650 700 750

00

.01

5

De

nsi

ty

1300 1500 1700

00

.01

5

De

nsi

ty

1500 1700 1900

00

.01

5

Ubiquitination

De

nsi

ty

900 1000 1100 1200

00

.01

5

Ubiquitination

De

nsi

ty

1100 1200 1300 1400

00

.01

5

Phosphorylation S/T Phosphorylation S/T

Phosphorylation Y Phosphorylation Y

c

De

nsi

ty

500 550 600 650 700 750

00

.01

5

De

nsi

ty

900 1000 1100 1200

00

.01

5

De

nsi

ty

1000 1100 1200 1300

00

.01

5

Dispensable Neutral Indispensable

Fig. S2. (A) Enrichment analysis of signaling proteins. Numbers of nodes overlapping with signaling proteins (annotated with signaling pathways in CellSignaling Technology database www.cellsignal.com/common/content/content.jsp?id=science-pathways) (53), receptors (54), protein kinases (55, 56) (kinase.com/kinbase/index.html), and transcription factors (57) are shown in red arrows and, their respective size-controlled random set distributions in gray bars. (B)Enrichment analysis of protein abundance. Numbers of nodes overlapping with high copy numbers (>100,000 copies) (A), moderate copy numbers (5000–100,000 copies), low copy numbers (500–5,000 copies) (C), and very low copy numbers (<500 copies) are shown in red arrows, and their respective size-con-trolled random set distributions in gray bars. The copy number dataset was obtained from Beck et el. (58). (C) Enrichment analysis of protein PTMs. Numbers ofnodes overlapping with any PTM [Acetylation, Tyrosine Phosphorylation (Phosphorylation Y), Serine/Threonine Phosphorylation (S/T), or Ubiquitination],Acetylation, Tyrosine Phosphorylation, Serine/Threonine Phosphorylation, and Ubiquitination datasets are shown in red arrows and their respective size-controlled random set distributions in gray bars. The PTM dataset was obtained from Woodsmith et al. (59).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 4 of 8

Page 11: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

−15

Cancer II

PubMed DegreeNone PubMed DegreeNone PubMed DegreeNone

OMIM GWAS

−10

−5

0

5

10

15

Z s

core

DispensableNeutralIndispensable

Dispensable Neutral Indispensable

Pe

rce

nta

ge

of

no

de

s

0

20

40

60

80

100

Critical node

Intermittent node

Redundant node

FDA targets

De

nsi

ty

250 300 350 400FDA targets

350 400 450 500FDA targets

150 200 250 300

Druggable genome ID

en

sity

350 400 450 500 550 600 450 500 550 600 650 200 250 300 350

De

nsi

ty

240 280 320 360 250 300 350 400 120 140 160 180 200 220

Dispensable Neutral Indispensable

Druggable genome II

Druggable genome I Druggable genome I

Druggable genome II Druggable genome II

00

.01

5

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

20

00

.01

5

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

20

00

.03

0

De

nsi

ty0

.00

00

.03

0

De

nsi

ty0

0.0

20

Dispensable

Cancer genes I

De

nsi

ty

60 80 100 120 140 160

00

.04

Neutral

Cancer genes I 100 120 140 160 180

Indispensable

Cancer genes I40 60 80 100 120 140

Cancer genes II

De

nsi

ty

500 600 700 800 900

00

.01

5

Cancer genes II800 900 1000

Cancer genes II400 500 600 700

OMIM disease genes

De

nsi

ty

350 400 450 500 550 600

00

.01

5

OMIM disease genes500 550 600 650 700

OMIM disease genes250 300 350 400

GWAS

De

nsi

ty

400 450 500 550

00

.01

5

GWAS450 500 550 600 650

GWAS200 250 300 350

De

nsi

ty0

0.0

4

De

nsi

ty0

0.0

3

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

15

De

nsi

ty0

0.0

20

De

nsi

ty

140 180 220

00

.04

De

nsi

ty

160 200 240

00

.02

0

HIV targets (RNAi screens)

De

nsi

ty

80 100 120 140

0.0

00

.04

HIV targets (PPIs)

De

nsi

ty

100 150 200 250

00

.02

0

HIV targets (PPIs)

De

nsi

ty

180 220 260 300

00

.02

HIV targets (PPIs)

De

nsi

ty

100 150 200

00

.04

Common virus targets

De

nsi

ty

200 240 280 320

00

.02

0

De

nsi

ty

220 260 300 340

00

.01

5

De

nsi

ty

100 140 180

00

.02

Common virus targets Common virus targets

Dispensable Neutral Indispensable

HIV targets (RNAi screens) HIV targets (RNAi screens)

−10

−5

0

5

10

15

Z s

core

HIV targets (RNAi)

PubMed DegreeNone PubMed DegreeNone PubMed DegreeNone

HIV targets (PPIs) Virus targets (208 viruses)

DispensableNeutralIndispensable

a

b

c

f

d

e

g

DispensableNeutralIndispensable

−10

−5

0

5

10

Z s

core

FDA approved

PubMed DegreeNone PubMed DegreeNone

Druggable

Fig. S3. (A) Enrichment analysis of disease genes. Numbers of nodes overlapping with genes causally associated with cancer (Cancer genes I, cancer genecensus) (Cancer Gene Census; cancer.sanger.ac.uk/census) (17) and a list of predicted cancer genes (Cancer genes II, extended list of cancer genes) (18), an-notated as disease genes in the OMIM database (omim.org) and associated with disease in GWAS (www.genome.gov/gwastudies), are shown in red arrows,and their respective size-controlled random set distributions in gray bars. (B) Enrichment analysis of disease genes using literature- and degree-controlledrandom sets. In the case of degree- or literature-controlled random sets, the random sets are sampled such that the average degree or average PubMed recordsof random sets matches the average of node type N. (C) Enrichment analysis of virus targets. Numbers of nodes overlapping with genes identified to have anadverse effect on HIV-1 replication when knocked down (RNAi screens) (21–24), human proteins that directly interact with HIV proteins (HIV targets PPI) (26,27), and human proteins that are known to physically interact with proteins from 208 viruses (common virus targets) (26–29) are shown in red arrows, and theirrespective size-controlled random set distributions in gray bars. (D) Enrichment analysis of virus targets using literature- and degree-controlled random sets.Random sets are generated as explained in B. (E) Enrichment analysis of drug targets. Numbers of nodes overlapping with proteins that are targeted byFDA-approved drugs (31), proteins with domains or folds that could bind to drug-like molecules (druggable genome I) (32), and a subset of druggable genome Iexcluding the FDA-approved drug targets (druggable genome II) are shown in red arrows, and their respective size-controlled random set distributions in graybars. (F) Enrichment analysis of drug targets using literature- and degree-controlled random sets. Random sets are generated as explained in B. (G) Charac-terizing indispensable, neutral, and dispensable nodes based on their roles as driver nodes. The recently developed approach is used to classify a node ascritical, intermittent, or redundant if it acts as a driver node in all, some, or none of the control configurations, respectively (33). The bar graph compares theindispensable, neutral, and dispensable nodes against the critical, intermittent, and redundant node classification.

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 5 of 8

Page 12: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

RAF1

RB1

RET

SHC1

SMAD2

SMAD3

SMAD4

SOS1

TEK

TGFBR1

TGFBR2

VAV1

Can

ce

r

OM

IM

Vir

us

FD

A d

rug

ABL1

AKT1

CDC42

CSK

DVL3

EEF1A1

EGFR

EPHB2

ERBB2

ERBB3

FGFR1

GSK3B

HRAS

Can

ce

r

OM

IM

Vir

us

FD

A d

rug

Human adenovirus 1

Vaccinia virus CopenhagenadrenocorticalVisna/maedi virus

lungDB08363

DB07379

Herpesvirus saimiri

Human herpesvirus 8

DB07354

Human adenovirus 2

Human papillomavirus type 31

JUN

Human papillomavirus type 18

TP53

JC polyomavirus

DB02872

small cell lung

RB1

glioma

Human adenovirus 7 MDM2Human adenovirus 3

BK virus strain AS

DB00030

Human adenovirus C

Human papillomavirus type 1a

retinoblastoma

CCND1

Human papillomavirus type 8

B−cell acute lymphocytic leukaemia

chronic lymphatic leukemia

CDK6

TOP1

DB01169

Human herpesvirus 4 type 1 Human adenovirus 12

Bovine papillomavirus type 1

Human immunodeficiency virus 1

Human adenovirus 5

Adeno−associated virus − 2

Human T−lymphotropic virus 1

Human papillomavirus type 16

Simian virus 40

DB00102

Deltapapillomavirus 4

chronic myelomonocytic leukemia acute leukemia

DB08574

gastric

DB01259

JAK2

DB04716

DB04003

Herpesvirus saimiri

Saimiriine herpesvirus 1 ERBB2MLLT4

Hepatitis C virus (isolate 1)

Influenza A virus

DB07317

B−cell Non−Hodgkin Lymphoma

PDGFRB

West Nile virus

DB01296

Myxoma virus

DB08059

gliobastoma

Hepatitis E virus

DB07585

DB08043

Influenza A virus

Human herpesvirus 4 type 2

Influenza A virus

diffuse large B−cell lymphoma

Human herpesvirus 8

DB08655

SARS coronavirus BCL2

Murid herpesvirus 4

DB01248

Human T−lymphotropic virus 3

CREBBPMurine polyomavirus

Hepatitis C virus

PIK3R1

ABL1

AKT1

T−cell acute lymphoblastic leukemia

LCK

Human papillomavirus type 6b

DB00459

DB08175 NCOA2

Polyomavirus HPyV7

non−Hodgkin lymphoma

Chondrosarcoma

HSP90AA1

acute promyelocytic leukemiaNFKB2

RARA

acute lymphocytic leukemia

non small cell lung cancer

breast

acute myelogenous leukemia

ovarian

Myeloproliferative disorder

Saimiriine herpesvirus 2

chronic myeloid leukemia

sarcoma

colorectal

Mouse polyomavirus

Alphapapillomavirus 7

Human papillomavirus type 11

Human adenovirus E

Murine adenovirus 1

Human adenovirus A

Moloney murine leukemia virus

Human herpesvirus 4

Mupapillomavirus 1

VirusCancer mutation FDA approved drugProteins

a

IGF1

IGF1R

IGF2

INS

INSR

IRS1

JAK1

JUN

KRAS

MAP2K1

MAP2K4

MAPK1

MAPK14

Can

ce

r

OM

IM

Vir

us

FD

A d

rug

PAK1

PDGFRB

PDPK1

RAC1

RAC2

MAPK3

MAPK7

MAPK8

MAPK9

MET

MYC

NCOA2

NTRK1

Can

ce

r

OM

IM

Vir

us

FD

A d

rug

b

Fig. S4. (A) Members of receptor tyrosine signaling pathways that are predicted as indispensable nodes and targeted by cancer mutations, OMIM disease,viruses, or FDA-approved drugs. RTK pathway members are as defined by the SignaLink database (60). (B) Indispensable nodes that are targeted by all threeinputs (cancer mutation, viruses, and drugs). The labels of FDA drug nodes correspond to DrugBank IDs. The network was generated using Cytoscape (61).

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 6 of 8

Page 13: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

IndispensableNeutral

Dispensable

IndispensableNeutral

Dispensable

a

Fraction of edges rewired

0.0 0.2 0.4 0.6 0.8 1.0

Fra

ctio

n o

f n

od

es

0.0

0.1

0.2

0.3

0.4

0.5

Fraction of edges flipped

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

Fraction of edges flipped

Fra

ctio

n o

f n

od

es

wit

h u

nch

an

ge

d

an

no

tati

on

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0Fraction of edges rewired

0.0 0.2 0.4 0.6 0.8 1.0

Fra

ctio

n o

f n

od

es

wit

h u

nch

an

ge

d

an

no

tati

on

0.0

0.2

0.4

0.6

0.8

1.0

Fra

ctio

n o

f n

od

es

ec

IndispensableNeutral

Dispensable

IndispensableNeutral

Dispensable

IndispensableNeutral

Dispensable

0.5 0.6 0.7 0.8 0.9

0.0

0.1

0.2

0.3

0.4

0.5

0.5

Fra

ctio

n o

f n

od

es

wit

h u

nch

an

ge

d

an

no

tati

on

0.6 0.7 0.8 0.9

0.0

0.2

0.4

0.6

0.8

1.0

Edge score cutoff

Fra

ctio

n o

f n

od

es

IndispensableNeutral

Dispensable

Edge score cutoff

Av

era

ge

Be

twe

en

ne

ss C

en

tra

lity

0

0.0010

0.0020

0.0005

0.0015

0.0025

0.0030

Av

era

ge

Clo

sen

ess

Ce

ntr

alit

y

0.05

0.10

0.15

0.20

0.25

Av

era

ge

Clu

ste

rin

g C

oe

ffici

en

t

0

0.01

0.02

0.03

0.04

0.05

Av

era

ge

Ne

igh

bo

rho

od

Co

nn

ect

ivit

y

0

10

20

30

40Type-IType-II

Type-IType-II

Type-IType-II

Type-IType-II

0

g

db f

Fig. S5. (A and B) Robustness of node classification. (A) The fraction of indispensable, neutral, and dispensable nodes is plotted as a function of edge filtering(filtering using edge score). (B) The fraction of nodes in the filtered network sharing the same node classification as the real network (unchanged annotation) isplotted as a function of edge filtering. (C–F) Analysis of node classification in perturbed networks. (C) The fraction of indispensable, neutral, and dispensablenodes is plotted as a function of fraction of edges rewired. (D) The fraction of nodes in the rewired network sharing the same node classification as the realnetwork (unchanged annotation) is plotted as a function of edge rewired. (E) Same as C, but the x axis corresponds to a fraction of flipped-edge directions.(F) Same as D, but the x axis corresponds to a fraction of flipped-edge directions. (G) Comparison of network properties of type-I and type-II indispensablenodes. The network betweenness centrality, closeness centrality, clustering coefficient, and neighborhood connectivity values are calculated using theNetworkAnalyzer Cytoscape plugin (62). The gray dotted line shows the average value of the network.

Cancer Gene Census genes

De

nsi

ty

0.05

0.10

0.15

0.20

P value = 2.5e−05

Z score = 4.2

0 2 4 6 8 10

De

nsi

ty

0

0.05

0.10

0.15

0.20 P value = 0.00037

Z score = 3.5

GO genes

0 2 4 6 8 10 12

STOP genes

De

nsi

ty

0

0.04

0 5 10 15 20

0.02

0.06

0.08

0.10

0.12

0.14

a b c

0

P value = 0.079

Z score = -1

Fig. S6. Enrichment analysis of type-II indispensable nodes frequently amplified/deleted in cancer. Numbers of type-II indispensable nodes (frequently am-plified/deleted in cancer) overlapping with genes causally associated with cancer (Cancer Gene Census) (17) (A), negative regulators of cell proliferation (STOPgenes) (37) (B), and positive regulators of cell proliferation (GO genes) (37) (C) are shown in red arrows, and their respective size-controlled random setdistributions in gray bars.

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 7 of 8

Page 14: Controllability analysis of the directed human protein ......in social networks analysis (e.g. finding “communities,”“hubs,” etc.). However, mathematically advanced engineering

Dataset S1. Node classification and datasets used in this study

Dataset S1

Node classification: Node classification based on network controllability. The table includes node classifications (indispensable, neutral, or dispensable), theirrole as a driver node (critical, intermittent, or redundant), in-degree, out-degree, and all other node properties analyzed in this study. Datasets: datasets used inthis study for enrichment analysis. Name, source, reference, number of genes/proteins compiled from the dataset, and overlap with the human-directed PPInetwork are listed.

Dataset S2. Enrichment analysis results and dataset overlaps

Dataset S2

Enrichment: results of enrichment analysis corresponding to Figs. 1E and 2 A–D. The table includes the number of overlapping genes, random mean, randomSD, z score, and P value for indispensable, neutral, and dispensable nodes. Enrichment-controlled: results of degree- and literature-controlled enrichmentanalysis, corresponding to Figs. 1E and 2 A–D. The table is similar to Enrichment but uses degree- and literature-controlled random sets. Dataset Overlap:indispensable nodes in RTK signaling pathways and the nodes’ overlap with cancer mutations, OMIM disease, virus targets, and drug targets (corresponding toFig. S4A). Indispensable Targets: indispensable nodes targeted by all three inputs of cancer mutations, virus targets, and drug targets (corresponds to Fig. S4B).

Dataset S3. Results from the analysis of edge-filtered network and subtype of indispensable nodes

Dataset S3

Edge filtering: results from the analysis of edge-filtered network. Edge rewiring: results from the analysis of rewired and direction-flipped networks.Subtypes: list of indispensable nodes with subtype classification (type-I and type-II). The table incudes Entrez gene ID; gene symbol; subtype annotation;results from edge-filtered, rewired, and direction-flipped networks; and overlap with the regulators of cell proliferation.

Dataset S4. Perturbed indispensable nodes in cancer genomic data

Dataset S4

TCGA dataset: the gene expression and CNAs from cancer genomic studies. Gene expression data (level 3 expression calls for genes, per sample) from TCGAand copy number alteration from cBioPortal. The overlapping samples refer to patient samples with both level 3 gene expression and CNA data available. Type-II nodes: most frequently amplified or deleted (top 1%) type-II indispensable nodes in nine different cancer types (corresponds to Fig. 4E). Enrichment:enrichment analysis of Cancer Gene Census, type-I and type-II indispensable nodes in human cancer.

Vinayagam et al. www.pnas.org/cgi/content/short/1603992113 8 of 8