Methods for Identifying Driver Pathways in Cancercs.brown.edu/research/pubs/theses/masters/2013/leiserson.pdf · Raphael. Methods for Identifying Driver Pathways in Cancer, Poster

BROWN UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE &CENTER FOR COMPUTATIONAL MOLECULAR BIOLOGY

Methods for Identifying Driver Pathways in Cancer

Author:Mark DM LEISERSON

Research Advisor:Prof. Benjamin RAPHAEL

Committee:Prof. Rodrigo FONSECA

Prof. Sohini RAMACHANDRANProf. Eli UPFAL

Wednesday, April 24, 2013

Contents

Author Summary 2

Publications 2

Presentations 2

Simultaneous Identification of Multiple Driver Pathways in Cancer 3Author Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Multi-Dendrix algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7The Multi-Dendrix Computational Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 8Somatic Mutation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Mutually exclusive sets in Glioblastoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Mutually exclusive sets in Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Weight function for gene sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14An ILP for the Maximum Weight Submatrix Problem . . . . . . . . . . . . . . . . . . . . . 14Multiple Maximum Weight Submatrices Problem . . . . . . . . . . . . . . . . . . . . . . . 15Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Construction of mutation matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Evaluating known interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Supporting Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Supporting Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Supporting Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Supporting Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

1

Author Summary

My research during my first two years at Brown has focused on the problem of identifying driver pathwaysin cancer. Specifically, I have worked on developing algorithms to discover driver mutations from next-generation sequencing data. For my research comps defense, I present a paper – co-authored with Prof.Raphael and with researchers at Tel-Aviv University – titled “Simultaneous Identification of Driver Pathwaysin Cancer” which has been accepted for publication by the journal PLoS Computational Biology. For thispaper, I both developed a new algorithm for identifying driver pathways and applied this approach to datafrom large cancer sequencing studies.

In addition to my research developing algorithms for the analysis of mutation data, I have also appliedmy computational expertise via collaborations with biology researchers. Working with Prof. Raphael,Fabio Vandin, and Hsin-Ta Wu, I performed analysis of exclusive mutations in a cohort of acute myeloidleukemia (AML) patients as part of The Cancer Genome Atlas (TCGA) Research Network’s AML project.The TCGA’s paper on this project was recently accepted for publication by the New England Journal ofMedicine, and I created a summary figure of my analysis that appears in the main text. I include the relevantreference in Publications below. Also, in collaboration with Prof. Raphael and researchers at WashingtonUniversity in St. Louis, I have performed network analysis of somatic and germline mutations in ovariancancer patients. This work is in preparation for submission to Nature Genetics, and my analysis is featuredin the results.

Publications

1. Mark DM Leiserson, Dima Blokh, Roded Sharan, and Benjamin J Raphael. Simultaneous Identifi-cation of Multiple Driver Pathways in Cancer. PLoS Comp Bio (in press).

2. The Cancer Genome Atlas Research Network. The genomic and epigenomic landscape of adult denovo Acute Myeloid Leukemia (AML). New England Journal of Medicine (in press).

Presentations

1. Mark DM Leiserson, Hsin-Ta Wu, Dima Blokh, Fabio Vandin, Roded Sharan, and Benjamin JRaphael. Methods for Identifying Driver Pathways in Cancer, Poster Presentation, Beyond the Genome,Boston, MA: September 2012.

2. Mark DM Leiserson, Hsin-Ta Wu, Fabio Vandin, and Benjamin J Raphael. Network Analysis of Mu-tations Across Cancer Types, Poster Presentation, The Research in Computational Molecular BiologyConference (RECOMB), Beijing, China: April 2013.

3. Mark DM Leiserson, Hsin-Ta Wu, Fabio Vandin, and Benjamin J Raphael. Network and PathwayAnalysis of Mutations Across Cancer Types, Poster Presentation, The Biology of Genomes Confer-ence, Cold Spring Harbor, NY: May 2013.

2

Simultaneous Identification of Multiple Driver Pathways in Cancer ∗

Mark D.M. Leiserson1, Dima Blokh2, Roded Sharan2, †, and Benjamin J. Raphael1, †

1Department of Computer Science and Center for Computational Molecular Biology, Brown University, Providence,RI, USA

2Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel†Equal contribution

Abstract

Distinguishing the somatic mutations responsible for cancer (driver mutations) from random, passen-ger mutations is a key challenge in cancer genomics. Driver mutations generally target cellular signalingand regulatory pathways consisting of multiple genes. This heterogeneity complicates the identificationof driver mutations by their recurrence across samples, as different combinations of mutations in driverpathways are observed in different samples.

We introduce the Multi-Dendrix algorithm for the simultaneous identification of multiple driver path-ways de novo in somatic mutation data from a cohort of cancer samples. The algorithm relies on twocombinatorial properties of mutations in a driver pathway: high coverage and mutual exclusivity. Wederive an integer linear program that finds set of mutations exhibiting these properties. We apply Multi-Dendrix to somatic mutations from glioblastoma, breast cancer, and lung cancer samples. Multi-Dendrixidentifies sets of mutations in genes that overlap with known pathways – including Rb, p53, PI(3)K, andcell cycle pathways – and also novel sets of mutually exclusive mutations, including mutations in sev-eral transcription factors or other genes involved in transcriptional regulation. These sets are discovereddirectly from mutation data with no prior knowledge of pathways or gene interactions. We show thatMulti-Dendrix outperforms other algorithms for identifying combinations of mutations and is also ordersof magnitude faster on genome-scale data.

Software available at: http://compbio.cs.brown.edu/software.

Author Summary

Cancer is a disease driven largely by the accumulation of somatic mutations during the lifetime of an in-dividual. The declining costs of genome sequencing now permit the measurement of somatic mutations inhundreds of cancer genomes. A key challenge is to distinguish driver mutations responsible for cancer fromrandom passenger mutations. This challenge is compounded by the observation that different combinationsof driver mutations are observed in different patients with the same cancer type.

One reason for this heterogeneity is that driver mutations target signaling and regulatory pathways whichhave multiple points of failure. We introduce an algorithm, Multi-Dendrix, to find these pathways solelyfrom patterns of mutual exclusivity between mutations across a cohort of patients. Unlike earlier approaches,we simultaneously find multiple pathways, an essential feature for analyzing cancer genomes where multiplepathways are typically perturbed. We apply our algorithm to mutation data from hundreds of glioblastoma,breast cancer, and lung adenocarcinoma patients. We identify sets of interacting genes that overlap knownpathways, and gene sets containing subtype-specific mutations. These results show that multiple cancer∗The work in this section has been accepted for publication by the journal PLoS Comp Bio.

3

http://compbio.cs.brown.edu/software

pathways can be identified directly from patterns in mutation data, and provide an approach to analyze theever-growing cancer mutation datasets.

Introduction

Cancer is a disease driven in part by somatic mutations that accumulate during the lifetime of an individ-ual. The declining costs of genome sequencing now permit the measurement of these somatic mutations inlarge numbers of cancer genomes. Projects such as The Cancer Genome Atlas (TCGA) and InternationalCancer Genome Consortium (ICGC) are now undertaking this task in hundreds of samples from dozens ofcancer types. A key challenge in interpreting these data is to distinguish the functional driver mutationsimportant for cancer development from random passenger mutations that have no consequence for cancer.The ultimate determinant of whether a mutation is a driver or a passenger is to test its biological function.However, because the ability to detect somatic mutations currently far exceeds the ability to validate ex-perimentally their function, computational approaches that predict driver mutations are an urgent priority.One approach is to directly predict the functional impact of somatic mutations using additional biologicalknowledge from evolutionary conservation, protein structure, etc. and a number of methods implementingthis approach have been introduced (see [1–4]). These methods are successful in predicting the impact ofsome mutations, but generally do not integrate information across different types of mutations (single nu-cleotide, indels, larger copy number aberrations, etc.); moreover, these methods are less successful for lessconserved/studied proteins.

Given the declining costs of DNA sequencing, a standard approach to distinguish driver from passengermutations is to identify recurrent mutations, whose observed frequency in a large cohort of cancer patientsis much higher than expected [5, 6]. Nearly all cancer genome sequencing papers, including those fromTCGA [7–10] and other projects [5, 11, 12], report a list of significantly mutated genes. However, drivermutations vary greatly between cancer patients – even those with the same (sub)type of cancer – and thisheterogeneity significantly reduces the statistical power to detect driver mutations by tests of recurrence. Oneof the main biological explanations for this mutational heterogeneity is that driver mutations target not onlyindividual genomic loci (e.g. nucleotides or genes), but also target groups of genes in cellular signaling andregulatory pathways. Consequently, different cancer patients may harbor mutations in different membersof a pathway important for cancer development. Thus, in addition to testing individual loci, or genes, forrecurrent mutation in a cohort of patients, researchers also test whether groups of genes are recurrentlymutated. Since exhaustive testing of all groups of genes is not possible without prohibitively large samplesizes (due to the necessary multiple hypothesis testing correction), current approaches focus on groups ofgenes defined by prior biological knowledge, such as known pathways (e.g. from KEGG [13]) or functionalgroups (e.g. from GO [14]), and methods have been introduced to look for enrichment in such pre-definedgroups of genes (e.g. [15–17]). More recently, methods that identify recurrently mutated subnetworks inprotein-protein interaction networks have also been developed, such as NetBox [18], MeMO [19], HotNet[20], and EnrichNet [21].

Knowledge of gene and protein interactions in humans remain incomplete, and most existing pathwaydatabases and interaction networks do not precisely represent the pathways and interactions that occur in aparticular cancer cell. Thus, restricting attention to only those combinations of mutations recorded in thesedata sources may limit the possibility for novel biological discoveries. Thus algorithms that do not make thisrestriction – but also avoid the multiple hypothesis testing problems associated with exhaustive enumeration– are desirable. Recently, the RME [22] and De novo Driver Exclusivity (Dendrix) [23] algorithms wereintroduced to discover driver pathways using combinatorial constraints derived from biological knowledgeof how driver mutations appear in pathways [24, 25]. In particular, each cancer patient contains a relativelysmall number of driver mutations, and these mutations perturb multiple cellular pathways. Thus, each driver

4

pathway will contain approximately one driver mutation per patient. This leads to a pattern of mutual exclu-sivity between mutations in different genes in the pathway. In addition, an important driver pathway shouldbe mutated in many patients, or have high coverage by mutations. Thus, driver pathways correspond to setsof genes that are mutated in many patients, but whose mutations are mutually exclusive, or approximatelyso. We emphasize that the driver pathways exhibiting patterns of mutually exclusivity and high coverage aregenerally smaller and more focused than most pathways annotated in the literature and pathway databases.The latter typically contain many genes and perform multiple different functions; e.g. the “cell cycle” path-way in KEGG contains 143 genes. It is well known that co-occurring (i.e., not exclusive) mutations areobserved in these larger, multifunctional biological pathways [25]. The RME and Dendrix algorithms usedifferent approaches to find sets of genes with high coverage and mutual exclusivity: RME builds sets ofgenes from pairwise scores of exclusivity, while Dendrix computes a single score for the mutual exclusivityof a set of genes, and finds the highest scoring set. The aforementioned MeMO algorithm [19] also consid-ers mutual exclusivity between mutations, but only for pairs of genes that have recorded interactions in aprotein-protein interaction network. Thus, MeMO does not attempt to identify driver pathways de novo andcan only define subnetworks in existing interaction networks. While many of the strongest signals of mutualexclusivity are between genes with known interactions, below we show examples in cancer data of mutualexclusive mutations between genes with no known direct iterations.

The two existing de novo algorithms, RME and Dendrix, consider the detection of only a single driverpathway from the pattern of mutual exclusivity between mutations. However, it is well known that mu-tations in several pathways are generally required for cancer [26]. There is little reason to assume thatmutations in different pathways will be mutually exclusive, and in contrast may exhibit significant patternsof co-occurrence across patients. Multiple pathways may be discovered using these algorithms by runningthe algorithm iteratively, removing the genes found in each previous iteration, and such an approach wasemployed for Dendrix [23]. However, such an iterative approach is not guaranteed to yield the optimal setof pathways.

Here we extend the Dendrix algorithm in three ways. First, we formulate the problem of finding exclu-sive, or approximately exclusive, sets of genes with high coverage as an integer linear program (ILP). Thisformulation allows us to find optimal driver pathways of various sizes directly – in contrast to the greedyapproximation and Markov Chain Monte Carlo algorithms employed in Dendrix. Second, we generalize theILP to simultaneously find multiple driver pathways. Third, we augment the core algorithm with additionalanalyses including: examining gene sets for subtype-specific mutations, summarizing stability of resultsacross different number and size of pathways, and imposing greater exclusivity of gene sets.

We apply the new algorithm, called Multi-Dendrix, to four somatic mutation datasets: whole-exome andcopy number array data in 261 glioblastoma (GBM) patients from The Cancer Genome Atlas (TCGA) [7],whole-exome and copy number array data in 507 breast cancer (BRCA) patients from TCGA [8], 601 se-quenced genes in 84 patients with glioblastoma multiforme (GBM) from TCGA [7] and 623 sequencedgenes in 188 patients with lung Adenocarcinoma [27]. In each dataset Multi-Dendrix finds biologicallyinteresting groups of genes that are highly exclusive, and where each group is mutated in many patients.In all datasets these include groups of genes that are members of known pathways critical to cancer devel-opment including: Rb, p53, and RTK/RAS/PI(3)K signaling pathways in GBM and p53 and PI(3)K/AKTsignaling in breast cancer. Multi-Dendrix successfully recovers these pathways solely from the pattern ofmutual exclusivity and without any prior information about the interactions between these genes. Moreover,Multi-Dendrix also identifies mutations that are mutually exclusive with these well-known pathways, andpotentially represent novel interactions or crosstalk between pathways. Notable examples include mutualexclusivity between: mutations in PI(3)K signaling pathway and amplification of PRDM2 (and PDPN) inglioblastoma; mutations in p53, GATA3 and cadherin genes in breast cancer.

Finally, we compare Multi-Dendrix to an alternative approach of iteratively applying Dendrix [23] orRME [22], two other algorithms that search for mutually exclusive sets. We show that these iterative ap-

5

proaches typically fail to find an optimal set of pathways on simulated data, while Multi-Dendrix findsthe correct pathways even in the presence of a large number of false positive mutations. On real cancersequencing data, the groups of genes found by Multi-Dendrix include more genes with known biologicalinteractions. Moreover, Multi-Dendrix is orders of magnitude faster than these other algorithms, allowingMulti-Dendrix to scale to the latest whole-exome datasets on hundreds of samples, which are largely beyondthe capabilities of Dendrix and RME. Multi-Dendrix is a novel and practical approach to finding multiplegroups of mutually exclusive mutations, and complements other approaches that predict combinations ofdriver mutations using biological knowledge of pathways, interaction networks, protein structure, or proteinsequence conservation.

Results

Multi-Dendrix algorithm

The Multi-Dendrix algorithm takes somatic mutation data from m cancer patients as input, and identifiesmultiple sets of mutations, where each set satisfies two properties: (1) the set has high coverage with manypatients having a mutation in the set; (2) the set exhibits a pattern of mutual exclusivity where most patientshave exactly one mutation in the set. We briefly describe the Multi-Dendrix algorithm here. Further detailsare provided in the Methods section below.

We assume that somatic mutations have been measured in m cancer patients and that these mutationsare divided into n different mutation classes. A mutation class is a grouping of different mutation typesat a specific genomic locus. In the simplest case, a mutation class corresponds to a grouping of all typesof mutations (single nucleotide variants, copy number aberrations, etc.) in a single gene. We represent thesomatic mutation data as an m× n binary mutation matrix A, where the entry Aij is defined as follows:

Aij =

{1 if gene j is mutated in patient i0 otherwise.

(1)

More generally, a mutation class may be defined for an arbitrary genomic locus, and not just a gene, and maydistinguish different types of mutations. For example, one may define a mutation class as single-nucleotidemutations in an individual residue in a protein sequence or in a protein domain. Or alternatively, one mayseparate different types of mutations in a gene (e.g. single-nucleotide mutations, deletions, or amplifications)by creating separate mutation classes for each mutation type in each gene. We will use this later definitionof mutation classes in the results below. For ease of exposition we will assume for the remainder of thissection that each mutation class is a gene.

Vandin et al. [23] formulate the problem of finding a set of genes with high coverage and high exclusivityas the Maximum Weight Submatrix Problem. Here the weight W (M) = |Γ(M)| − ω(M) of a set M ofgenes is the difference between the coverage |Γ(M)|, the number of patients with a mutation in one of thegenes in M , and the coverage overlap ω(M), the number of patients having a mutation in more than onegene in M . Vandin et al. [23] introduce the De novo Driver Exclusivity (Dendrix) algorithm [23] that findsa set M of k genes with maximum weight W (M).

While finding single driver pathways is important, most cancer patients are expected to have driver mu-tations in multiple pathways. Dendrix used a greedy iterative approach to find multiple gene sets (describedbelow), that is not guaranteed to find optimal gene sets. Identification of multiple driver pathways requiresa criterion to evaluate possible collections of gene sets. Appealing to the same biological motivation asabove, we expect that each pathway contains approximately one driver mutation. Moreover, since eachdriver pathway is important for cancer development, we also expect that most individuals contain a drivermutation in most driver pathways. Thus, we expect high exclusivity within the genes of each pathway and

6

high coverage of each pathway on its own. One measure that satisfies these criteria is to find a collectionM = {M1,M2, ...,Mt} of gene sets whose sum of weights is maximized.

We define the Multiple Maximum Weight Submatrices problem as the problem of finding such a maxi-mum weight collection. We solve the Multiple Maximum Weight Submatrix problem using an integer linearprogram (ILP), and refer to the resulting algorithm as Multi-Dendrix (see Methods). In addition, the ILPformulation used in Multi-Dendrix uses a modified weight function Wα(M) = |Γ(M)| − αω(M), whereα > 0 is a parameter that adjusts the tradeoff between finding sets with higher coverage Γ(M) (more pa-tients with a mutation) versus higher coverage overlap ω(M) (greater non-exclusivity between mutations).We use this parameter in the breast cancer dataset below. In contrast, Dendrix was limited to α = 1.

Simulated data

We compare Multi-Dendrix to iterative versions of Dendrix [23] and RME [22] on simulated mutation datawith both driver mutations implanted in pathways in a mutually exclusive manner and random passengermutations. The goal of these simulations is to compare Multi-Dendrix to other algorithms that identifymutually exclusive genes on straightforward datasets that contain multiple mutually exclusive sets. Wegenerate mutation data form = 160 patients and n = 360 genes as follows. We select a set of four pathwaysP = (P1, P2, P3, P4) with each Pi containing four genes. We select the coverage Γ(Pi) uniformly from thefollowing intervals: [0.75m, 0.9m], [0.6m, 0.75m], [0.45m, 0.6m], [0.3m, 0.45m], respectively. The size ofthis dataset and the varying coverages of the pathways model what is observed in real data (see § SomaticMutation data) and is consistent with models of mutation progression where driver mutations accumulatein pathways [28]. For each pathway Pi, we select |Γ(Pi)| patients at random and add a driver mutation toexactly one gene from the set Pi. Thus, the driver mutations in each pathway are mutually exclusive. Wethen add passenger mutations by randomly mutating genes in each patient with probability, q, the passengermutation probability.We used values of q similar to our estimates for q on the TCGA GBM and Lung cancerdata sets (in § Somatic Mutation data below), which were q = 0.001 and q = 0.0005, respectively. Weemphasize that these simulations do not model all of the complexities of somatic mutations in cancer e.g.gene-specific and patient-specific mutation rates, genes present in multiple pathways, etc.

Since the Dendrix and RME algorithms are designed to find single pathways, we compared Multi-Dendrix to iterative versions of these methods that return multiple gene sets. For Dendrix we used theiterative approach described in [23]: apply Dendrix to find a highest scoring gene set, remove those genesfrom the dataset, and apply Dendrix to the reduced dataset, repeating these steps until a desired number t ofgene sets are found. We will refer to this algorithm as Iter-Dendrix. Thus, Iter-Dendrix returns a collectionP = (P1, P2, . . . , Pt) of t gene sets such that W (P1) ≥ W (P2) ≥ · · · ≥ W (Pt). We implementedthe analogous iterative version of RME, and will refer to this algorithm as Iter-RME. We compared thecollection M of gene sets found by each algorithm to the planted pathways P, computing the symmetricdifference d(P,M) between M and P as described in Methods.

Table 1 shows a comparison of Multi-Dendrix, Iter-Dendrix, and Iter-RME on simulated mutation datafor different values of q. Note that we do not show comparisons to Iter-RME for q ≥ 0.005 as Iter-RME didnot complete after 24 hours of runtime for any of the 1000 simulated mutation data sets. While the RMEpublication [22] analyzed mutation matrices with thousands of genes and hundreds of patients, this analysis(and the released RME software) required that mutations were presented in at least 10% of the samples,greatly reducing the number of genes/samples input to the algorithm. In fact, a threshold of 10% willremove nearly all genes in current whole-exome studies (see § Comparison of Multi-Dendrix and RME).

For 0.0005 ≤ q ≤ 0.015, Multi-Dendrix identifies collections of gene sets that were significantlycloser (p < 0.01) to the planted pathways P than the collections found by either Iter-Dendrix and Iter-RME. These results demonstrate that Multi-Dendrix outperforms other methods, even when the passengermutation probability q is more than 15 times greater than the value estimated from real somatic mutation

7

data. For q ≤ 0.0001, the differences between Multi-Dendrix and Iter-RME were not significant.We also compared the runtimes of each algorithm on the simulated datasets. Multi-Dendrix was sev-

eral orders of magnitude faster than Iter-Dendrix and Iter-RME on all datasets (Table 2). Note that as thepassenger mutation probability q increases, the number of recurrently mutated passenger genes increases.Multi-Dendrix scales much better than Iter-RME and maintains a significant advantage over Iter-Dendrix,completing all simulated datasets in less than 5 seconds.

We evaluated how the runtime of Multi-Dendrix scales to larger datasets. Using the same passenger mu-tation probabilities 0.0001 ≤ q ≤ 0.02 listed above, we calculated the average runtime in seconds of Multi-Dendrix for ten simulated mutation matrices with m = 100, 200, 400, 800, 1600, 3200, 6400, 12800, 22000genes and n = 1000 patients, more than the number of patients to be measured in any cancer study fromTCGA. In each case, we run Multi-Dendrix only on the subset of genes that are mutated in more than theexpected number nq of samples. For the largest dataset with m = 22000 genes, the average number ofgenes input to Multi-Dendrix for the highest and lowest passenger mutation probabilities are ∼ 9700 and∼ 2100, respectively. (Table S1 shows the average number of input genes for varyingm and q.) The averageruntime for this largest dataset is under one hour (average of 54.4 minutes). Figure S1 shows the runtimesfor varying m and q.

The Multi-Dendrix Computational Pipeline

We incorporate the Multi-Dendrix algorithm into a larger pipeline (Figure 1) that includes several additionalpre- and post-processing tasks including: (1) Building mutation matrices for input into Multi-Dendrix; (2)Summarizing Multi-Dendrix results over multiple values for the parameters t, the number of gene sets, kmin

the minimum size of a gene set, and kmax the maximum size of a gene set; (3) Evaluating the statisti-cal significance of results; (4) Examining Multi-Dendrix results for mutually exclusive sets resulting fromsubtype-specific mutations. We describe these steps briefly below, with further details in the Methods andSupporting Information.

First, we build mutation matrices A from somatic mutation data. We use several steps to process single-nucleotide variant (SNV) data, copy number variant (CNV) data, and to combine both types of data. Second,in contrast to simulated data, on real data we do not know the correct values of the parameters t, kmin, andkmax. Thus, we consider a reasonable range of values for these parameters and summarize the results overthese parameters into modules. We build a graph, where the nodes are individual genes (or mutation classes)and edges connect genes (respectively mutation classes) that appear in the same gene set for more than onevalue of the parameters. We weight each edge with the fraction of parameter values for which the pair ofgenes appear in the same gene set. The resulting edge-weighted graphs provide a measure of the stability ofthe resulting gene sets over different parameter values. By choosing a minimum edge weight, we partitionthe graph into connected components, or modules. One may choose to use these modules as the output ofMulti-Dendrix.

Third, we evaluate the statistical significance of our results using two measures. Since the collection Mwith high weight W ′(M) may not be surprising in a large mutation matrix A, the first measure evaluates thesignificance of the scoreW ′(M) maximized by Multi-Dendrix. We evaluate whether the weightW ′(M∗) ofthe maximum weight collection M∗ output by Multi-Dendrix is significantly large compared to an empiricaldistribution of the maximum weight sets from randomly permuted mutation data. We generate randommutation data using the permutation test described in [19]. This test permutes the mutations among the genesin each patient, preserving both the number of mutated genes in each patient and the number of patients witha mutation in each gene while perturbing any patterns of exclusivity between mutated genes. Note that thispermutation test requires running Multi-Dendrix many times to determine statistical significance for a singleparameter setting. Thus, the runtime advantages of Multi-Dendrix compared to Iter-Dendrix and Iter-RMEare very important in practice on real datasets.

8

Next, we evaluate whether the collection M∗ output by Multi-Dendrix contains more protein-proteininteractions than expected by chance by applying our direct interactions test on a PPI network constructedfrom the union of the KEGG and iRefIndex PPI networks. The direct interactions test computes a statisticν of the difference in the number of interactions within and between gene sets in M∗, and compares theobserved value of ν to an empirical distribution on 1000 permuted PPI networks (full details of the testare in § Evaluating known interactions). These permuted networks account for the observation that manygenes that are frequently mutated in cancer also have large degree in the interaction network – either due tobiological reasons or ascertainment bias. We use an interaction network to assess biological function ratherthan known pathways (e.g. KEGG pathways or GSEA sets) because most of these pathways are relativelylarge, while the gene sets found by Multi-Dendrix that exhibit exclusivity tend to be much smaller, eachcontaining only a few genes.

Finally, we examine possible correlations between the mutually exclusive sets reported by Multi-Dendrixand particular subsets of samples. A number of cancers are divided into subtypes according to pathology,cytogenetics, gene expression, or other features. Since mutations that are specific to particular subtypes willbe mutually exclusive, disease heterogeneity is an alternative explanation to pathways for observed mutuallyexclusive sets. For example, [29] report four subtypes of GBM based on gene expression clusters, andshow that several mutations – including IDH1, PDGFRA, EGFR, and NF1 – have strong association withindividual subtypes. Unfortunately, if the subtypes are unknown there is no information for Multi-Dendrix,Dendrix, RME, or other algorithms that analyze mutual exclusivity to distinguish between mutual exclusivityresulting from subtypes and mutual exclusivity resulting from pathways or other causes. If subtypes areknown, two possible solutions are to analyze subtypes separately, or to examine whether patterns of mutualexclusivity are associated to these subytpes. We annotate results by known subtypes as a post-processingstep in Multi-Dendrix.

Somatic Mutation data

We applied Multi-Dendrix and Iter-Dendrix to four somatic mutation matrices: (1) copy number variants(CNVs), small indels, and non-synonymous single nucleotide variants (SNVs) measured in 601 genes in84 glioblastoma multiformae (GBM) patients [7]; (2) indels and non-synonymous single nucleotide vari-ants in 623 sequenced genes in 188 Lung Adenocarcinoma patients [27]; (3) CNVs, small indels, andnon-synonomous SNVs measured using whole-exome sequencing and copy number arrays in 261 GBMpatients [7]; and (4) CNVs, small indels, and non-synonymous SNVs measured in 507 BRCA patients. Wewill refer to these datasets as GBM(2008), Lung, GBM, and BRCA below. We removed extremely lowfrequency mutations and known outliers from these datasets as described in Methods. After this processing,the GBM(2008) dataset contained mutation and CNV data for 46 genes in 84 patients; the Lung datasetcontained somatic mutation for 190 genes in 163 patients; the GBM dataset contained mutation and CNVdata for 398 genes in 261 patients; and the BRCA dataset contained mutation and CNV data for 375 genesin 507 patients. We focus here on presenting results from the latter two datasets because they are the latestwhole-genome/exome datasets and most representative of the datasets that are now being produced and willbe analyzed now and in the coming years. Results with the first two older and smaller datasets from targetedsequencing are described in the Supporting Information.

We compute 2 ≤ t ≤ 4 gene sets, each of minimum size kmin = 3 and maximum size ranging from3 ≤ kmax ≤ 5. We summarize the results over these 9 different parameter values into modules using theprocedure described above.

9

Mutually exclusive sets in Glioblastoma (GBM)

We applied Multi-Dendrix and Iter-Dendrix to the GBM dataset, considering EGFR amplification as aseparate event (see Methods). The algorithms report the same results over all values of the parametersexcept kmax = 5, where Iter-Dendrix includes the IRF5 gene in a gene set with RB1, CDK4(A), andCDKN2A/CDKN2B(D), and MSL3. However, Multi-Dendrix is significantly faster running in 142 secondscompared to 37,786 seconds (over 10 hours) for Iter-Dendrix.

We summarize the results of these different parameter choices by connecting genes that appear in thesame gene set at least twice, resulting in four modules (Figure 2). These four modules include all the genes(except ERBB2) that are: (1) members of the three signaling pathways highlighted in the TCGA GBMstudy [7], and (2) are mutated in > 5% of the samples. The weight W ′(M) of all collections found byMulti-Dendrix on the GBM dataset are significant P < 0.0001) and the direct interactions statistic ν ofthese four modules is also significant (P = 0.002). Three of the four modules also contain a significantnumber of interactions (P < 0.05). In addition to these four modules, two additional mutation classes,CNTNAP2 and deletion of 10q26.3, each appear in one choice of parameters for Multi-Dendrix. Sincethese are not part of a larger module, they were not further analyzed. Figure S2 shows a combined mutationmatrix with all four modules.

The first module includes the amplification of CDK4, mutation of RB1, and a deletion that includes bothCDKN2A and CDKN2B. This module is mutated in 87.7% (229 / 261) of the samples, and are discoveredfor all parameter choices. These four genes are members of the RB signaling pathway (as annotated in [7])involved in G1/S progression (P = 9.7 × 10−14 by Bonferonni-corrected hypergeometric test): CDKN2Aand CDKN2B inhibits CDK4, which in turn inhibits RB1. In addition for 7/9 parameter choices, thismodule includes mutations in MSL3. MSL3 is a member of the MSL (male-specific lethal) complex thathas a major role in dosage compensation in Drosophila. While this complex is conserved in mammals,the specific function of human MSL3 is unknown. However, the MSL complex also includes the histoneacetyltransferase MOF which is involved in cell cycle regulation of p53 and may play a role in cancer [30].Thus, the mutual exclusivity of mutations in MSL3 and the other well-known members of the RB signalingpathway is intriguing and deserves further study. This module contains two interactions (P = 0.005).

The second module includes mutations and deletion of PTEN, mutations in PIK3CA, mutations inPIK3R1, mutations in IDH1, and an amplification that includes PDPN and PRDM2. The module is mutatedin 62.8% (164/261) of the samples. PTEN, PIK3CA, and PIK3R1 are all members of the RTK/RAS/PI(3)Ksignaling pathway (as annotated in [7]) involved in cellular proliferation (P = 3.2 × 10−11). IDH1 isnot a known member of this pathway; moreover, IDH1 is preferentially mutated in the proneural subtype ofGBM [29]. Deletions in PTEN are also associated with the proneural subtype of GBM, although they are notconsidered a defining feature of this subtype (as IDH1 mutations are) and do not result in a gene expressionsignature [29]. However, there are no reports that PTEN, PIK3CA, or PIK3R1 mutations are subtype spe-cific, and thus the mutual exclusivity of IDH1 and the remaining genes in this set is not simply explained bysubtypes. PRDM2 is not known to be part of the RTK/RAS/PI(3)K signaling pathway. PRDM2 is a memberof the histone methyltransferase superfamily, interacts with the RB protein [31], and is proposed as a tumorsuppressor in colorectal cancer [32]. PDPN is used as a molecular marker for glioma, due to its associationwith clinical outcomes [33]. Our results suggest that PDPN and PRDM2 may have an undiscovered role inGBM as well. This module contains three interactions (P = 0.001).

The third module includes mutations in TP53, the amplification of MDM2, the amplification of MDM4,mutations in NLRP3, and the deletion involving AKAP6 and NPAS3. This module is mutated in 57.8%(151/261) of the samples, and appears for every parameter choice for t > 2. TP53, MDM2, and MDM4 aremembers of the p53 signaling pathway (P = 9.7×10−14), a critical and frequently altered pathway in GBMinvolved in senescence and apoptosis. NPAS3 is a transcription factor expressed in the brain and implicatedin psychiatric disorders including schizophrenia [34, 35]. In addition, NPAS3 was recently shown to act as

10

a tumor suppressor in astrocytomas, with a possible role in glioblastoma progression and proliferation [36].This module contains three interactions (P = 0.001).

The fourth module includes mutations in EGFR, the amplification of PDGFRA, and the deletion of RB1.This module is mutated in 45.6% (119/261) of samples, and appears for t = 4. EGFR and PDGFRA aremembers of the RTK/RAS/PI(3)K signaling pathway (P = 8.9 × 10−9), and RB1 is a member of the RBsignaling pathway. While EGFR and PDGFRA both interact with RAS, there are no reported direct inter-actions between these three proteins. In addition, mutations in these three genes are significantly associatedwith two of the expression subtypes reported in [29]: mutations in EGFR and the deletion of RB1 are associ-ated with the Classical GBM subtype, and the amplification of PDGFRA is significantly associated with theProneural subtype. Thus, it appears that the mutual exclusivity discovered by Multi-Dendrix is a result ofsubtype-specific mutations, despite PDGFRA and EGFR being a member of the same biological pathway.

In summary, we see that subtype-specific mutations provide an alternative explanation for observedmutual exclusivity and confound the identification of driver pathways. However, on the GBM data subtype-specific mutations are a minor feature in the data, and Multi-Dendrix successfully identifies de novo portionsof three critical signaling pathways in GBM.

Mutually exclusive sets in Breast Cancer (BRCA)

We applied Multi-Dendrix and Iter-Dendrix to the BRCA dataset. We found that for most values of theparameters t and kmax, the results combined the most frequently mutated genes into a single gene setdespite the fact that these genes had high coverage overlap (Figures S4 and S5). That is, for a geneset M , high coverage |Γ(M)| was outweighing a high coverage overlap ω(M) in the weight functionW (M) = |Γ(M)| − ω(M) optimized by Multi-Dendrix. To enforce greater mutual exclusivity, we in-creased the coverage overlap penalty to α = 2.5 from its default value of α = 1.

Using α = 2.5, Multi-Dendrix identifies four distinct modules (Figure 3). These four modules overlapwith three pathways known to be important in BRCA: p53 signaling, PI(3)K/AKT signaling, and cell cy-cle checkpoints. In addition, they include genes recently identified by [8] as important BRCA genes. Thesemodules cover a smaller proportion of samples than our results on the GBM dataset even though they includethe same number of genes, suggesting greater mutational heterogeneity or disease heterogeneity (i.e. sub-types) in the breast cancer dataset. Indeed, breast cancers are commonly divided into four major subtypes:Luminal A/B, Basal, and HER2 type. We annotate the subtype-specific mutations below, and Table S6 liststhe significant associations between mutations and subtypes. The weight W ′(M) of all collections foundby Multi-Dendrix on the BRCA dataset are significant (P < 0.01), and the direct interactions statistic νof these four modules is also significant (P < 0.001). In addition, one module is significantly enrichedfor interactions (P < 0.05). In addition to these four modules, five additional genes (SF3B1, CCDC150,COL23A1, C20orf26, PCDHA5) each appear in one choice of parameters for Multi-Dendrix. Since theseare not part of a larger module, they were not further analyzed. Figure S3 shows a combined mutation matrixwith all four modules.

The first module includes the deletion of PTEN, mutations in PIK3CA, PIK3R1, AKT1, HIF3A, andthe amplification of genomic region 12p13.33. These six mutation classes are mutated in 61% (308/507)of the samples, and this module is discovered for all values of the parameters. PTEN, PIK3CA, PIK3R1,and AKT1 form the core of the PI(3)K/AKT signaling pathway, as annotated in [8] (P < 10−15): AKT1is known to interact with PTEN, PIK3R1, and PIK3CA, while PTEN inhibits PIK3CA and PIK3R1. Thesegenes constitute five of the eight genes mutated in the PI(3)K/AKT signaling pathway, as reported by [8].One of the other mutation classes in this module is the amplification of 12p in 82 samples. This amplificationevent has been reported before in BRCA, although the likely target of this amplification is unknown [37].This module contains six interactions (P < 0.001).

The second module includes mutations in the genes TP53, CDH1, GATA3, CTCF, and GPRIN2. These

11

five genes are mutated in 56% (283 / 507) of the samples, and are discovered for each parameter choice.TP53 is a member of the p53 signaling pathway, while GATA3, CDH1, and CTCF are well-known fortheir role in breast cancer and are involved in metastasis and proliferation. The loss or downregulation ofCDH1, the E-cadherin gene located on 16q22.1, is implicated in breast cancer invasion and proliferation(reviewed in [38]). CTCF, also located on 16q22.1, has been found to act as a tumor suppressor in breastcancer through mechanisms similar to CDH1 [39]. GATA3 is a transcription factor that regulates immunecells, and has long been known to be involved in breast cancer tumorigenesis [40]. Recently, a novel rolefor GATA3 was discovered, whereby GATA3 suppresses breast cancer metastasis through inhibition of E-cadherin promoters [41]. This module contains zero known interactions.

The third module includes the deletion of MAP2K4 and mutations in MAP3K1, PPEF1, SMARCA4,and WWP2. These five mutations occur in 44.4% (225/507) of the samples, and this module is identifiedwhen the number t of gene sets is at least 3. MAP3K1 and MAP2K4 are both members of the p38-JNK1stress kinase pathway as reported in [8], and are involved in the regulation of apoptosis. Both MAP3K1and MAP2K4 are serine/threonine kinases, while PPEF1 is a serine/threonine phosphatase. The targetsof PPEF1 are unknown, although there are reports that PPEF2, another member of the gene family, doesinteract with the p38-JNK pathway through the gene ASK1 [42]. SMARCA4 is also a known cancer gene,and has been shown to have a role as a tumor suppressor in lung cancer [43]. This module contains oneinteraction (P = 0.179).

The fourth module includes the amplification of CCND1 and mutations in MAP2K4, RB1, and GRID1.These four mutations occur in 36.3% (184/507) of the samples, and this module is discovered when thenumber t of gene sets is 4. CCND1 and RB1 encode interacting proteins that play a role cell cycle progres-sion. CCND1 encodes the cyclin-D1 protein that inhibits the retinoblastoma protein encoded by RB1 viahyperphosphorylation. The hyperphosphorylation of RB inactivates its role as a tumor suppressor, and thusmutations that target either CCND1 or RB1 are thus important for proliferation in cancer [44]. Mutationsin MAP2K4, as discussed above for the third module, target the p38-JNK1 pathway, and MAP2K4 is notknown to have any interactions with CCND1 or RB1. This module contains one interaction (P = 0.1).

We separately ran Multi-Dendrix on mutation data restricted to the 224 Luminal A patients, the 124Luminal B patients, the 93 Basal-like patients, and the 58 HER2-enriched patients (as annotated in [8])using different values of α for the different sized datasets (see Supporting Information § BRCA subtypes fordetails). The gene pair {TP53, GATA3} from the Multi-Dendrix modules is also identified by Multi-Dendrixwhen restricting to basal-like or luminal B samples. The same is true for the gene pair {AKT1, PIK3CA}in HER2-enriched and luminal A subtypes. Thus, the mutual exclusivity between mutations in these pairsof genes is not a result of subtype-specific mutations. On the other hand, the mutual exclusivity between12p13.33 amplification and PIK3CA mutation appears to be an effect of subtypes, as these aberrations areassociated with basal-like and luminal A subtypes, respectively (Table S6), and these two aberrations arenot grouped into the same module on any Multi-Dendrix run of individual subtypes. These results show thatmutual exclusivity can result from mutational heterogeneity within pathways, disease heterogeneity withsubtype-specific mutations, or both.

Discussion

We introduce an algorithm Multi-Dendrix that simultaneously finds multiple cancer driver pathways usingsomatic mutation data from a collection of patients. Multi-Dendrix finds combinations of somatic mutationssolely from the combinatorial pattern of their mutations with no prior knowledge of pathways or interactionsbetween genes. On simulated data, Multi-Dendrix outperforms iterative versions of Dendrix and RME, twoprevious algorithms for identifying single driver pathways. Multi-Dendrix finds optimal groups of mutuallyexclusive mutations even in the presence of significant noise where the passenger mutation rate is 15 times

12

greater than observed in real data. Multi-Dendrix is orders of magnitude faster than iterative versions ofDendrix and RME and scales to the analysis of mutations in thousands of genes from hundreds of cancerpatients. Finally, Multi-Dendrix finds optimal solutions over a range of sizes for the individual gene sets,while Dendrix examines only a fixed size gene set for each run.

We apply Multi-Dendrix to multiple cancer datasets, including glioblastoma and breast cancer data fromThe Cancer Genome Atlas, two of the most comprehensive somatic mutation datasets from high-throughputsequencing data. Multi-Dendrix finds multiple driver pathways of interacting genes/proteins including por-tions of the p53, RTK/RAS/PI(3)K, PI(3)K/AKT, and Rb signaling pathways. At the same time, we identifyadditional genes whose mutations are mutually exclusive with mutations in these pathways. Intriguingly,these additional genes are transcription factors or other nuclear proteins involved in regulation of transcrip-tion (SMARCA4, NPAS3, PRDM2, and MSL3). In general, gene transcription is the downstream “output”of signaling pathways, suggesting that mutations in these downstream targets may be substituting for muta-tions in the upstream (and presumably more general) signaling pathways.

Our results demonstrate the advantages of simultaneously finding sets of mutually exclusive genes.However, there can be multiple explanations for observed mutual exclusivity in a set of genes includingboth pathways (i.e., interactions between genes) as well as disease heterogeneity (i.e., cancer subtypes).Interpretation of Multi-Dendrix results should consider these various explanations. To facilitate these anal-yses, we include additional steps in the Multi-Dendrix pipeline to compare the results sets of mutations toknown gene interactions and/or known subtypes in the samples.

Our Multi-Dendrix analysis shows an interesting feature of mutual exclusivity between SNVs and dele-tions in tumor suppressor genes. In the glioblastoma data, we find that SNVs in PTEN and the deletion ofPTEN are mutually exclusive. It is not clear whether this is a genuine biological feature of the mutationalprocess, where either SNVs or deletions (but not both) are present in different samples, or merely an artifactof the mutation calling. Regarding the latter, it is possible that SNVs are more challenging to detect inhemizygous samples, where the other allele has been deleted.

There are several extensions that might further improve the Multi-Dendrix algorithm. First, the ILP usedin Multi-Dendrix finds optimal solutions effectively, but is not guaranteed to rigorously examine suboptimalsolutions. This is in contrast to the Markov Chain Monte Carlo (MCMC) approach used by Dendrix thatsamples suboptimal solutions in proportion to their weight. Extending the MCMC approach to simultaneousdiscovery of multiple pathways, or perhaps using the ILP to initialize a sampling procedure are interestingdirections for future research. Second, the weight function used by the Multi-Dendrix algorithm does notexplicitly incorporate co-occurrence of mutations between genes in different gene sets. Instead, the weightfunction is highest for gene sets with high coverage and approximate exclusivity, which may co-occur inmany patients simply due to high coverage (a trivial example is when all gene sets have full coverage, andthus all mutations co-occur across gene sets). As noted in [25], co-occurrence of mutations is known to beimportant in large biological pathways, and thus algorithms that explicitly optimize for collections of genesets where mutations between gene sets co-occur a surprising amount may have an advantage in identifyingkey components of these larger biological pathways.

We anticipate that Multi-Dendrix will be useful for analyzing somatic mutation data from differenttypes of cancer and with larger cohorts of patients, such as those now being generated by The CancerGenome Atlas (TCGA) and other large-scale cancer sequencing projects. However, mutual exclusivity andhigh coverage are not the only criteria for selecting driver mutations, and thus the model optimized byMulti-Dendrix has some limitations. As noted above, it is well-known that each patient has multiple drivermutations and there are many examples of co-occurring mutations. Some of these co-occurring mutationsare clearly in different pathways, although this depends on one’s definition of a pathway. Large, multi-functional “pathways” such as the cell-cycle pathway do indeed exhibit co-occurring mutations; on theother hand, co-occurring mutations appear to be less common in directly interacting genes/proteins. Inaddition, the coverage, or frequency, of important driver mutations may vary considerably, according to the

13

stage at which patients are sequenced, or extensive disease heterogeneity. Private mutations that are uniqueto a single individual in a study might be driver mutations, but will not exhibit a strong signal of mutualexclusivity. Multiple approaches are required to prioritize somatic mutations for further experimental study.Multi-Dendrix is a useful complement for analysis of significantly mutated genes [6, 45, 46], functionalimpact [1–4], pathway/network analysis [18–22], and other approaches.

Methods

Weight function for gene sets

Given a mutation matrix A, and a set M of genes (or equivalent mutation classes), Vandin et al. [23] definethe weight functionW (M) as follows. For a gene g, the coverage Γ(g) = {i : Aig = 1} is the set of patientsin which gene g is mutated. Similarly, for a set M of genes, the coverage is Γ(M) = ∪g∈MΓ(g). M ismutually exclusive if Γ(g) ∩ Γ(g′) = ∅, for all g, g′ ∈ M , g 6= g′. A gene set in A is a column submatrixof A with high coverage and approximate exclusivity. Since increasing coverage may be achieved at theexpense of decreasing exclusivity, [23] define a weight W (M) on set M of columns of A that quantifiesboth coverage and exclusivity of M . In particular, they define the coverage overlap of M as ω(M) =∑

g∈M |Γ(g)|−|Γ(M)|. Note that ω(M) = 0 whenM is mutually exclusive. ThenW (M) is the differencebetween the coverage and coverage overlap of M :

W (M) = |Γ(M)| − ω(M) = 2|Γ(M)| −∑g∈M|Γ(g)|. (2)

Note that for a mutually exclusive submatrix M , Γ(M) =∑

g∈M Γ(g) and therefore W (M) = Γ(M).Vandin et al. [23] introduce the Maximum Weight Submatrix Problem defined for an integer k > 0

as the problem of finding the m × k submatrix M of A that maximizes a weight W (M), show that thisproblem is NP-hard and derive the De novo Driver Exclusivity (Dendrix) algorithm to solve it. Dendrix isa Markov Chain Monte Carlo (MCMC) algorithm that samples sets of k genes in proportion to their weightW . While the MCMC algorithm is not guaranteed to find a gene set of optimal W , the authors showed thatapplying the method to real somatic mutation data produces gene sets with high coverage and approximateexclusivity, and that the Markov chain rapidly converges to a stationary distribution.

An ILP for the Maximum Weight Submatrix Problem

We formulate the Maximum Weight Submatrix Problem as an integer linear program, which we refer to asDendrixILP (k). Given a mutation matrix A and gene set size k, DendrixILP (k) finds a gene set M∗ withlargest weight W (M∗). A gene set M is determined by a set of indicator variables, one for each gene j,

IM (j) =

{1 if gene j is a member of gene set M ,0 otherwise.

(3)

To compute the weight function W (M) in (2), it is necessary to compute the coverage Γ(M). To do this,we define an indicator variable for each patient i,

Ci(M) =

{1 if gene set M is mutated in patient i,0 otherwise.

(4)

14

Then, DendrixILP (k) is defined as follows:

maximizem∑i=1

2 · Ci(M)−n∑j=1

IM (j) ·Aij

(5a)

subject to

n∑j=1

IM (j) = k (5b) n∑j=1

Aij · IM (j)

≥ Ci(M), (5c)

for 1 ≤ i ≤ m.

Note that the last constraint (5c) only forces Ci(M) = 0 when all genes in M are not mutated (i.e.∑nj=1Aij · IM (j) = 0), but does not force Ci(M) = 1 when at least one gene in M is mutated as re-

quired by (4). However, in the latter case the objective function will be maximized when Ci(M) = 1 andthus (4) is satisfied.

Note that removing equation (5b) from DendrixILP (k) above produces a gene set with optimal W forany value of k, i.e. a gene set M with maximum W (M) of any size. We will refer to this variant of theILP as DendrixILP . In addition, we can set bounds on the size of the gene set M , kmin ≤ |M | ≤ kmax, byreplacing equation (5b) with

kmin ≤n∑j=1

IM (j) ≤ kmax. (6)

We will refer to this variant as DendrixILP (kmin, kmax).

Multiple Maximum Weight Submatrices Problem

We define the following problem.

Multiple Maximum Weight Submatrices Problem. Given an m × n mutation matrix A and an integert > 0, find a collection M = {M1,M2, . . . ,Mt} of m × k column submatrices that maximizes W ′(M) =∑t

ρ=1W (Mρ).

Note that this problem is NP-hard, as stated above for the case t = 1. Further note that while theweight W ′(M) is increased by greater mutual exclusivity between mutations within a gene set, there isno restriction on the mutations between gene sets. Moreover, collections with large weight W ′(M) willalso tend to have larger coverage Γ(Mρ) of each individual gene set ρ. Thus, optimal solutions will tendto produce a collection with the property that many patients have a mutation in more than one gene set;or alternatively, there may be pairs or larger sets of co-occurring mutations, a phenomenon that has beenobserved in cancer [25].

We solve the Multiple Maximum Weight Submatrix problem using an integer linear program (ILP). Wedefine an ILP, called Multi-Dendrix, using the same definitions of Ci(M) and IM (i) as above:

maximizet∑

ρ=1

m∑i=1

2 · Ci(Mρ)−n∑j=1

IMρ(j) ·Aij

(7a)

15

subject to n∑j=1

IMρ(j) ·Aij

≥ Ci(Mρ), (7b)

for 1 ≤ i ≤ m, 1 ≤ ρ ≤ t,t∑

ρ=1

IMρ(j) ≤ 1, 1 ≤ j ≤ m. (7c)

We solve this ILP using CPLEX v12.3 using default parameters. It is obvious that for a mutation matrixA, the collections M and I of gene sets produced by Multi-Dendrix and Iter-Dendrix, respectively satisfyW ′(M) ≥ W ′(I). Multi-Dendrix can also produce sets M with strictly larger weight than the sets Iproduced by Iter-Dendrix. There are a number of ways this might occur. First, there may be multiple genesets Iρ with maximum weight on iteration ρ, and only one of these is extended by the iterative algorithm.Second, the maximum weight gene set Iρ selected by Iter-Dendrix in the ρth iteration may not be a memberof the optimal M; i.e. M may contain gene sets that are suboptimal when considered in isolation. Third,when kmin < kmax Multi-Dendrix may select gene sets with fewer than kmax genes if this maximizes theweight W ′(M) of the whole collection. We find that all of these cases occur on real somatic mutation data.

Multi-Dendrix can be extended to allow genes to be members of more than one gene set. Such genesmay be involved in multiple biological processes. We define the parameter ∆ to be the maximum number ofgene sets a gene can be a member of, and the parameter τ to be the number of overlaps allowed per gene set.Then we replace Eq. 7c with

∑tρ=1 IMρ(j) ≤ ∆,∀j and add the constraint

∑nj=1

∑ρ′ 6=ρ IMρ(j) · IMρ′ (j) ≤

τ, 1 ≤ ρ ≤ t. Note that the product in the second group of constraints must be encoded using additionalindicator variables and thus the number of additional constraints grows rapidly as more overlaps betweengene sets are allowed. While this extension is implemented in the Multi-Dendrix package, we did not use itfor the results presented herein.

Simulations

We ran Dendrix with default parameters using 4 × 104n iterations for the MCMC method, where n is thenumber of genes. We also used the default parameters for RME. We ran each algorithm with the true valuest = 4 and kmin = kmax = 4. We compared the collection M of gene sets found by each algorithm to theplanted pathways P, computing the difference between M and P as follows:

d(P,M) = argminπ

|P|∑ρ=1

|Pπ(ρ) 4Mρ|,

where π is a permutation of (1, 2, 3, 4) and4 is the symmetric difference between sets.

Construction of mutation matrices

We have two pipelines for building mutation matrices from somatic mutation data: one for the whole-exomedatasets and the second for the targeted gene sequencing datasets, GBM(2008) and lung adenocarcinoma.Here we describe the pipeline for the whole-exome datasets. See Supporting Information for further detailson the targeted gene sequencing data processing.

16

Whole-exome and copy number aberration data preparation For the whole-exome datasets, we ex-tracted non-silent somatic mutations from MAF files and copy number aberrations from GISTIC 2.0 [47]downloaded from the TCGA portal:

• GBM: http://gdac.broadinstitute.org/runs/analyses 2012 09 13/data/GBM/20120913/.

– MAF file: gdac.broadinstitute.org GBM.Mutation Assessor.Level 4.2012091300.0.0/GBM.maf.annotated.

– GISTIC wide peaks: gdac.broadinstitute.org GBM.CopyNumber Gistic2.Level 4.2012091300.0.0.tar.gz.

• BRCA: https://tcga-data.nci.nih.gov/docs/publications/brca 2012/.

– MAF file: “Somatic MAF archive [tar.gz] (public access)”.

– Segment data: “Merged segment file”.

– GISTIC wide peaks: “Genes in focal peaks [xslx]”.

We applied the following filters to remove genes from the analysis:

• 10,443 genes in the GBM dataset and 11,428 genes in the BRCA dataset mutated in fewer than 5samples.

• 209 genes in the GBM dataset and 474 genes in the BRCA dataset whose coding regions are longerthan 6Kb and had fewer mutations than expected using a binomial test with an estimate of 10−6 forthe background mutation rate.

• 94 genes from both datasets that are observed to have unusually high rates of somatic mutation inexome-sequencing data, but are likely artifacts resulting from replication timing, active transcription,and other factors (as reported by M. Lawrence here: http://1.usa.gov/RBtuz7). These include olfactoryreceptors, “cub and sushi” proteins, and TTN.

• 10 additional genes (KRTAP5-5, HEATR7B2, EML2, CC2D1A, DUS3L, GABRA6, FLG, HYDIN,SUSD3, CROCCL1) that also appear to be artifacts as they were observed to have higher than averagemutation rates, but with no known role in cancer.

For copy number data, we used the “wide peaks” in the GISTIC output as the copy number aberrationsin each sample, with copy number ratio thresholds of 0.1 and -0.1 for determining if a segment is amplifiedor deleted, respectively. We performed the following additional filtering of the copy number data.

• We combine copy number aberrations that co-occur in the same sample > 50% of the time and arewithin 10Mb of one another into a single larger “meta” gene.

• We restricted analysis to 249 CNVs in the GBM data and 204 CNVs in the BRCA data that appearedin wide peaks with 20 or fewer genes or were given a peak label by GISTIC.

• For BRCA, we removed segments longer than 10Mb.

• Remove wide peaks containing known fragile sites at PARK2 [48], WWOX [49], RPL5, and FAM19A5.

Note that in many cases, the wide peaks include multiple genes which we combine into a single “meta” genefor Multi-Dendrix analysis. We manually selected the target gene for the following metagenes:

• For GBM, we selected CDK4, CDKN2A and CDKN2B, PTEN, PDGFRA, MDM2, MDM4, PIK3R1,and RB1 as the targets of their respective aberrations.

17

http://gdac.broadinstitute.org/runs/analyses__2012_09_13/data/GBM/20120913/

http://gdac.broadinstitute.org/runs/analyses__2012_09_13/data/GBM/20120913/gdac.broadinstitute.org_GBM.Mutation_Assessor.Level_4.2012091300.0.0.tar.gz

http://gdac.broadinstitute.org/runs/analyses__2012_09_13/data/GBM/20120913/gdac.broadinstitute.org_GBM.CopyNumber_Gistic2.Level_4.2012091300.0.0.tar.gz

https://tcga-data.nci.nih.gov/docs/publications/brca_2012/

http://tcga-data.nci.nih.gov/docs/publications/brca_2012/genome.wustl.edu_BRCA.IlluminaGA_DNASeq.Level_2.3.2.0.tar.gz

https://tcga-data.nci.nih.gov/docs/publications/brca_2012/brca_hg19_qc.merged.seg

https://docs.google.com/viewer?url=http%3A%2F%2Ftcga-data.nci.nih.gov%2Fdocs%2Fpublications%2Fbrca_2012%2FGENES%2520IN%2520FOCAL%2520PEAKS.xlsx

• For BRCA, we used the GISTIC peak labels as the target for each aberration.

• In the GBM dataset, we treated EGFR as a separate event and did not include it in the Multi-Dendrixanalysis. EGFR amplification is the most common amplification, and second most common mutation(after CDKN2A/B deletion). There is a significant co-occurrence between EGFR amplifications andSNVs (p = 9.4 × 10−5 by Fisher’s exact test), although EGFR amplification occurs in 74 moresamples than EGFR SNV.

After this filtering, we analyze a total of 398 mutation classes in GBM and 375 mutation classes in BRCA(i.e. separating SNVs and CNVs in individual genes or metagenes).

Evaluating known interactions

We assess whether the mutually exclusive gene sets found by Multi-Dendrix are enriched for interactinggenes using the following direct interactions test. We use two analogous tests: one test for collections, andone test for for individual gene sets.

For a collection P = (P1, . . . , Pt), let α(P) be the fraction of possible interactions between pairs ofgenes within each gene set Pi that are observed and β(P) be the fraction of possible interactions betweenpairs of genes in different pairs of gene sets Pi, Pj that are observed. We define ν(P) = α(P) − β(P).Thus, a large value ν(P) indicates a collection with many interactions between genes in the same geneset and few interactions between genes in different gene sets. This measure models our assumptions thatmutual exclusivity should be strongest between genes that directly interact. Moreover, we also expect fewinteractions between genes in different gene sets reported by Multi-Dendrix. Thus, the measure ν is a morestrict measure of the topology of interactions than merely counting the total number of observed interactions.Similarly, we assess an individual gene set by counting the number of interactions among genes in that geneset. Let ρ(P ) be the number of interactions among genes in P .

We assess the statistical significance of ν and ρ by comparing the observed value to the empirical dis-tribution of ν and ρ on an ensemble of permuted networks with the same degree distribution. We permutethe network by swapping edges between pairs of genes, so as to preserve the distribution of edges withinthe network. We perform Q × |E| edge swaps, where E is the edges in the network and Q is some con-stant (we use Q = 100 as recommended for permuting graphs with fixed degree distribution in [50]). Thisnetwork permutation corrects for the observation that many genes that are frequency mutated in cancer alsohave many interactions in human PPI networks. Since frequently mutated genes typically appear in Multi-Dendrix results, the number of interactions between genes in Multi-Dendrix results might be higher than arandom set of genes of the same size. When performing the direct interactions test, we use a protein-proteininteraction (PPI) network constructed from the union of interactions reported in KEGG [13] and iRefIndex9.0 [51] containing 236,060 interactions among 15,257 proteins. We calculate p-values from 1000 permutedversions of this combined network.

By examining only direct interactions between genes in our sets, we might miss cases where genes donot directly interact, but have a common interacting partner (e.g. EGFR, PDGFRA, and NF1 all share RASas an interacting partner in the fourth module of the GBM results). Average pairwise distance is anothercommonly used metric for assessing whether groups of genes are clustered on an interaction network, andthis metric might identify such cases. However, the tradeoff is that the diameter (average pairwise distance)of most biological interaction networks is small, and thus many genes are close on the network. We foundthat average pairwise distance was not a strict enough measure for examining mutually exclusive gene sets(data not shown). Finally, note that counting the number of interactions between genes in a PPI networkis only an approximate measure of biological function. Current interaction networks have large number offalse positive interactions – particularly when interactions from high-throughput experiments are included

18

– as well as an unknown number of false negatives. In addition, there is a problem of ascertainment bias ascancer genes are some of the most studied human genes and many of their interactions have been assessed.

Acknowledgments

The authors thank Hsin-Ta Wu for his assistance in visualizing mutation matrices, and the anonymousreviewers for their helpful suggestions.

References

1. Gonzalez-Perez A, Lopez-Bigas N (2012) Functional impact bias reveals cancer drivers. Nucleicacids research : 1–10.

2. Adzhubei IA, Scmidt S, Peshkin L, Ramensky VE, Gerasimoa A, et al. (2010) A method and serverfor predicting damaging missense mutations. Nature methods 7: 248–249.

3. Reva B, Antipin Y, Sander C (2011) Predicting the functional impact of protein mutations: applicationto cancer genomics. Nucleic acids research 39: e118.

4. Kumar P, Henikoff S, Ng PC (2009) Predicting the effects of coding non-synonymous variants onprotein function using the SIFT algorithm. Nature protocols 4: 1073–81.

5. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, et al. (2006) The consensus coding sequences ofhuman breast and colorectal cancers. Science 314: 268–274.

6. Getz G, Hofling H, Mesirov J, Golub T, Meyerson M, et al. (2007) Comment on “The consensuscoding sequences of human breast and colorectal cancers”. Science 317: 1500.

7. The Cancer Genome Atlas Research Network (2012) Comprehensive molecular portraits of humanbreast tumours. Nature 490: 61-70.

8. The Cancer Genome Atlas Research Network (2008) Comprehensive genomic characterization de-fines human glioblastoma genes and core pathways. Nature 455: 1061-8.

9. The Cancer Genome Atlas Research Network (2012) Comprehensive molecular characterization ofhuman colon and rectal cancer. Nature 487: 330–7.

10. The Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carci-noma. Nature 474: 609–15.

11. Puente XS, Pinyol M, Quesada V, Conde L, Ordonez GR, et al. (2011) Whole-genome sequencingidentifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475: 101–5.

12. Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, et al. (2012) The landscape of cancergenes and mutational processes in breast cancer. Nature 486: 400–4.

13. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretationof large-scale molecular data sets. Nucleic Acids Res 40: D109–114.

14. Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unifi-cation of biology. Nature Genetics 25: 25-29.

19

15. Wendl MC, Wallis JW, Lin L, Kandoth C, Mardis ER, et al. (2011) PathScan: a tool for discerningmutational significance in groups of putative cancer genes. Bioinformatics (Oxford, England) 27:1595–602.

16. Lin J, Gan CM, Zhang X, Jones S, Sjoblom T, et al. (2007) A multidimensional analysis of genesmutated in breast and colorectal cancers. Genome research 17: 1304–18.

17. Boca SM, Kinzler KW, Velculescu VE, Vogelstein B, Parmigiani G (2010) Patient-oriented gene setanalysis for cancer mutation data. Genome biology 11: R112.

18. Cerami E, Demir E, Schultz N, Taylor BS, Sander C (2010) Automated network analysis identifiescore pathways in glioblastoma. PLoS ONE 5: e8918.

19. Ciriello G, Cerami E, Sander C, Schultz N (2012) Mutual exclusivity analysis identifies oncogenicnetwork modules. Genome Res 22: 398–406.

20. Vandin F, Upfal E, Raphael B (2011) Algorithms for detecting significantly mutated pathways incancer. Journal of Computational Biology 18: 507-22.

21. Glaab E, Baudot A, Krasnogor N, Schneider R, Valencia A (2012) EnrichNet: network-based geneset enrichment analysis. Bioinformatics (Oxford, England) 28: i451–i457.

22. Miller C, Settle S, Sulman E, Aldape K, Milosavljevic A (2011) Discovering functional modules byidentifying recurrent and mutually exclusive mutational patterns in tumors. BMC medical genomics4: 34.

23. Vandin F, Upfal E, Raphael B (2012) De novo discovery of mutated driver pathways in cancer.Genome Res 22: 375–385.

24. Vogelstein B, Kinzler KW (2004) Cancer genes and the pathways they control. Nat Med 10: 789–799.

25. Yeang C, McCormick F, Levine A (2008) Combinatorial patterns of somatic gene mutations in cancer.The FASEB Journal 22: 2605-22.

26. Hanahan D, Weinberg R (2011) Hallmarks of cancer: the next generation. Cell 144: 646–74.

27. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, et al. (2008) Somatic mutations affect keypathways in lung adenocarcinoma. Nature 455: 1069–1075.

28. Gerstung M, Eriksson N, Lin J, Vogelstein B, Beerenwinkel N (2011) The temporal order of geneticand pathway alterations in tumorigenesis. PLoS ONE 6: e27136.

29. Verhaak RGW, Hoadley Ka, Purdom E, Wang V, Qi Y, et al. (2010) Integrated genomic analysisidentifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA,IDH1, EGFR, and NF1. Cancer cell 17: 98-110.

30. Rea S, Xouri G, Akhtar A (2007) Males absent on the first (MOF): from flies to humans. Oncogene26: 5385–94.

31. Buyse IM, Shao G, Huang S (1995) The retinoblastoma protein binds to RIZ, a zinc-finger proteinthat shares an epitope with the adenovirus E1A protein. Proceedings of the National Academy ofSciences of the United States of America 92: 4467–71.

20

32. Chadwick RB, Jiang GL, Bennington Ga, Yuan B, Johnson CK, et al. (2000) Candidate tumor sup-pressor RIZ is frequently involved in colorectal carcinogenesis. Proceedings of the National Academyof Sciences of the United States of America 97: 2662–7.

33. Ernst A, Hofmann S, Ahmadi R, Becker N, Korshunov A, et al. (2009) Genomic and expression pro-filing of glioblastoma stem cell-like spheroid cultures identifies novel tumor-relevant genes associatedwith survival. Clinical cancer research 15: 6541–50.

34. Erbel-Sieler C, Dudley C, Zhou Y, Wu X, Estill SJ, et al. (2004) Behavioral and regulatory abnormal-ities in mice deficient in the NPAS1 and NPAS3 transcription factors. Proceedings of the NationalAcademy of Sciences of the United States of America 101: 13648–53.

35. Huang J, Perlis RH, Lee PH, Rush AJ, Fava M, et al. (2010) Cross-Disorder Genomewide Analysis ofSchizophrenia, Bipolar Disorder, and Depression. American Journal of Psychiatry 167: 1254–1263.

36. Moreira F, Kiehl TR, So K, Ajeawung NF, Honculada C, et al. (2011) NPAS3 demonstrates featuresof a tumor suppressive role in driving the progression of Astrocytomas. The American journal ofpathology 179: 462–76.

37. van der Groep P, van der Wall E, van Diest PJ (2011) Pathology of hereditary breast cancer. Cellularoncology (Dordrecht) 34: 71–88.

38. Cowin P, Rowlands TM, Hatsell SJ (2005) Cadherins and catenins in breast cancer. Current opinionin cell biology 17: 499–508.

39. Green AR, Krivinskas S, Young P, Rakha Ea, Paish EC, et al. (2009) Loss of expression of chromo-some 16q genes DPEP1 and CTCF in lobular carcinoma in situ of the breast. Breast cancer researchand treatment 113: 59–66.

40. Usary J, Llaca V, Karaca G, Presswala S, Karaca M, et al. (2004) Mutation of GATA3 in humanbreast tumors. Oncogene 23: 7669–78.

41. Yan W, Cao QJ, Arenas RB, Bentley B, Shao R (2010) GATA3 inhibits breast cancer metastasisthrough the reversal of epithelial-mesenchymal transition. The Journal of biological chemistry 285:14042–51.

42. Kutuzov Ma, Bennett N, Andreeva AV (2010) Protein phosphatase with EF-hand domains 2 (PPEF2)is a potent negative regulator of apoptosis signal regulating kinase-1 (ASK1). The internationaljournal of biochemistry & cell biology 42: 1816–22.

43. Medina PP, Romero Oa, Kohno T, Montuenga LM, Pio R, et al. (2008) Frequent BRG1/SMARCA4-inactivating mutations in human lung cancer cell lines. Human mutation 29: 617–22.

44. Kaye FJ (2002) RB and cyclin dependent kinase pathways: defining a distinction between RB andp16 loss in lung cancer. Oncogene 21: 6908–14.

45. Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, et al. (2012) MuSiC: identifying muta-tional significance in cancer genomes. Genome research 22: 1589–98.

46. Greenman C, Wooster R, Futreal PA, Stratton MR, Easton DF (2006) Statistical analysis ofpathogenicity of somatic mutations in cancer. Genetics 173: 2187–98.

21

47. Mermel C, Schumacher S, Hill B, Meyerson M, Beroukhim R, et al. (2011) GISTIC2.0 facilitatessensitive and confident localization of the targets of focal somatic copy-number alteration in humancancers. Genome Biology 12: R41.

48. Mitsui J, Takahashi Y, Goto J, Tomiyama H, Ishikawa S, et al. (2010) Mechanisms of genomic in-stabilities underlying two common fragile-site-associated loci, PARK2 and DMD, in germ cell andcancer cell lines. American journal of human genetics 87: 75–89.

49. Bednarek AK, Laflin KJ, Daniel RL, Liao Q, Hawkins KA, et al. (2000) WWOX, a Novel WWDomain-containing Protein Mapping to Affected in Breast Cancer Advances in Brief. Cancer re-search 60: 2140–2145.

50. Milo R, Kashtan N, Itzkovitz S, Newman MEJ, Alon U (2003) On the uniform generation of randomgraphs with prescribed degree sequences. eprint arXiv:cond-mat/0312028 .

51. Razick S, Magklaras G, Donaldson I (2008) iRefIndex: a consolidated protein interaction databasewith provenance. BMC Bioinformatics 9: 405.

22

Figures

0. Data preprocessing 2. Analysis and annotation 3. Output

CNV filteringGisticwidepeaks

SNV filteringnsSNVsindels

Mutation frequencyExpected no. of SNVs

Multi-Dendrix

Rapidly identifies setsof driver pathways

Subtype-specificmutations

Stabilitymeasures

Consensusmodules

Multi-DendrixWeb Site

Consistent amp/delRemove artifacts

Max. gene set size

No. gene sets

iREFIndex

Computeenrichment stats

Min. gene set size

Mutation Matrix A

n genes andmutation classes

m patients Permutation

test

1. Input mutation matrix

Figure 1: The Multi-Dendrix pipeline. Multi-Dendrix analyzes integrated mutation data from a variety ofsources including single-nucleotide mutations and copy number aberrations. Multiple gene set are identifiedusing a combinatorial optimization approaches. The output is analyzed for subtype-specific mutations andsummarized across multiple values of the parameters: t, number of gene sets, and kmax, maximum size pergene set.

23

CDKN2A(D) (176)

CDK4(A) (53)

RB1 (19)

MSL3 (5)

TP53 (76)

MDM4(A) (41)

MDM2(A) (23)

NPAS3(D) (21)

NLRP3 (6)

EGFR (52)

PDGFRA(A) (46)

RB1(D) (35) Exclusive mutations

Co-occurring mutations

Amp Del SNV

EGFR PDGFRA

RAS

CDKN2A

CDK4

RB1

P = 0.005

Coverage: 88% (229/261) of samples

IDH1 (14)

PTEN (76)

PTEN(D) (41)

PIK3CA (17)

PIK3R1 (17)

PRDM2(A) (13)




a

RB1

CDK4(A)

MSL3

CDKN2ACDKN2B (A)

7

77

9

9

9

b

c

d

MDM2 MDM4

TP53P = 0.001

P = 0.001PTEN

PIK3CA

PIK3R1

EGFRRB1(D)

PDGFRA

3 3

3

Network Diagram Legend

Subtype-associatedgeneStability measure

Gene p53 SignalingRTK/RAS/PI(3)KSignalingRB SignalingOther

Protein Interaction Legend

Gene GeneGene GeneGene Geneinhibits activates interacts

Mutation Matrix Legend

IDH1

PTEN

PIK3R1

PRDM2(A)

PTEN(D)

PIK3CA

6 6

96

3

63 3

3 33

3

3

NPAS3(D) NLRP3

MDM4(A)MDM2(A)

TP53

6

6

4

4

42

2

2

6

2

Figure 2: Multi-Dendrix results on the GBM dataset. (Left) Nodes represent genes in four modulesfound by Multi-Dendrix using t = 2, . . . , 4 gene sets of minimize size kmin = 3 and maximum size kmax =3, . . . , 5. Genes with “(A)” appended are amplification events, genes with “(D)” appended are deletionevents, and genes with no annotation are SNVs. Edges connect genes that appear in the same gene set formore than one value of the parameters, with labels indicating the fraction of parameter values for whichthe pair of genes appear in the same gene set. Color of nodes indicates membership in three signalingpathways noted in [7] as important for GBM: RB, p53, and RTK/RAS/PI(3)K signaling. Shape of nodesindicates genes whose mutations are associated with specific GBM subtypes, and dashed edges connectgenes associated with different subtypes. The direct interactions statistic ν of this collection of gene sets issignificant (P = 0.002). (Middle) Known interactions between proteins in each set and p-value for observednumber of interactions. (Right) Mutation matrix for each of four modules with mutual exclusive (blue) andco-occurring mutations (orange).

24

P < 0.001


a

b

c

d Network Diagram Legend

PIK3CA (178)12p13.33(A) (82)

PTEN(D) (71)PIK3R1 (14)

AKT1 (12)HIF3A (5)

3

PTEN(D)

AKT1

PIK3R1

HIF3A

12p3.33(A)

PIK3CA

669

33

2

3

23

PTEN AKT1

PIK3CA

PIK3R1

MAP3K1

MAP2K4


TP53 (186)GATA3 (54)CDH1 (33)CTCF (13)

GPRIN2 (5)

GATA3

CDH1CTCF GPRIN2

GATA3CDH1

TP539 9

9

366

3

336

MAP2K4(D) (172)MAP3K1 (39)

PPEF1 (7)SMARCA4 (6)

WWP2 (6)


P = 0.179

WWP2 SMARCA4

PPEF1MAP3K1

MAP2K4(D)

2

66

6

33

33

3 3

RB1 (9)

CCND1(A) (151)MAP2K4 (21)

GRID1 (7)


RB1 MAP2K4

CCDN1(A)

GRID1

3

2 2

2

3

3

P = 0.1CCND1

RB1

p53 SignalingPI(3)K/ATKSignalingCell cycle checkpointsp38-JNK1Other

Normal-likeHER2-enriched

Luminal A

Luminal BBasal-likeSubtype annotations legend

Figure 3: Multi-Dendrix results on the BRCA dataset. Graphical elements are as described in Figure 2caption, except for the following. Color of nodes indicates membership in four signaling pathways notedin [8] as important for BRCA: p53 signaling, PI(3)K/AKT signaling, cell cycle checkpoints, and p38-JNK1.The top row of each mutation matrix annotates the subtype of each patient. The regulatory interactionbetween GATA3 and CDH1 is shown as a dashed line. The direct interactions statistic ν of this collectionof gene sets is significant (P < 0.001).

25

Tables

Avg. distance d from planted pathwaysq Multi-Dendrix Iter-RME Iter-Dendrix

0.0 0.02± 0.19 0.01 ± 0.12 0.30± 0.860.0001 0.02 ± 0.18 0.01 ± 0.16 0.30 ± 0.860.0005 0.04 ± 0.23 0.10 ± 0.40 0.35 ± 0.890.001 0.10 ± 0.35 0.32 ± 0.60 0.44 ± 1.010.005 0.44 ± 0.71 – 0.75 ± 1.070.01 1.03 ± 1.00 – 1.20 ± 1.150.015 1.68 ± 1.16 – 1.78 ± 1.260.02 2.17 ± 1.24 – 2.21 ± 1.29

Table 1: A comparison of the algorithms on simulated mutation data with varying passenger mutationprobability q. Italicized rows correspond to values of q approximated from real cancer datasets. Eachentry is mean (µ) and standard deviation (σ) (across 1000 simulations) of the distance d(P,M) betweenthe planted set of pathways P and the collections M found by each algorithm. The minimum distanced = 0 indicates an algorithm found the planted pathways exactly, while the maximum distance d = 16indicates that an algorithm did not find any of the genes in the planted pathways. Bold text indicates thetop performing algorithm for each value of q. Multi-Dendrix is the top performer for all values of q exceptthe smallest q = 0.0001. The differences between Multi-Dendrix and both Iter-Dendrix and Iter-RME arestatistically significant (p < 0.01) for 0.0005 ≤ q ≤ 0.01. For q > 0.001, Iter-RME did not complete after24 hours of runtime.

26

Avg. runtime (secs)q Multi-Dendrix Iter-RME Iter-Dendrix

0.0001 0.28 19.22 609.790.0005 0.28 17.36 621.220.001 0.50 123.16 610.210.005 1.46 > 86, 400 672.430.01 2.60 > 86, 400 711.55

0.015 4.06 > 86, 400 730.850.02 4.93 > 86, 400 727.82

Table 2: A comparison of the runtimes of Multi-Dendrix, Iter-RME, and Iter-Dendrix on simulated mutationdata with varying passenger mutation probability q. Runtimes for each algorithm are reported as the meanruntime of 10 runs for each value of q. Note that Multi-Dendrix has runtimes under 5 seconds for alldatasets, and is orders of magnitude faster than the other methods for each q. For Iter-RME, we report aruntime of >86,400 seconds for q ≥ 0.005 as Iter-RME did not complete within one day. Simulations wereperformed on machines running 64-bit Debian Linux with Xeon 2.8GHz processors and a maximum of 8GBof available memory.

27

Supporting Information

GBM(2008) and Lung datasets

Somatic mutation processing in targeted gene sequencing studies.

For the GBM(2008) and lung adenocarcinoma datasets, we used all genes with non-synonymous single-nucleotide mutations or small indels in at least two patients. For GBM(2008), we also included copy numbervariants (CNVs), adding a CNV mutation type for a gene if the gene had a CNV of the same type (ampli-fication or deletion) in at least 90% of the patients with a CNV. If a set of genes was mutated in the samepatients, we merged these genes into a single metagene. We manually merged CYP27B1 into the metagenecontaining CDK4, as described in [23].

Multi-Dendrix results

We ran Multi-Dendrix and Iter-Dendrix on the GBM(2008) and Lung datasets using the same parametersour analysis of the GBM and BRCA datasets: minimum gene set size kmin = 3, maximum gene set sizekmax = 3 . . . 5, number t of gene sets t = 2 . . . 4, for a total of 9 parameter combinations. We analyzeMulti-Dendrix’s results below, and present a comparison of Multi-Dendrix’s and Iter-Dendrix’s results in§ Comparison of Multi-Dendrix and Iter-Dendrix results.

GBM(2008). Figure S6 shows Multi-Dendrix’s results on the GBM(2008) dataset. Multi-Dendrix findsfour modules that contain known cancer genes. In addition eight genes appear in the results for one or twoparameter choices, but with different groups of genes. Two modules contain triples that are consistent acrossall parameter choices: {CDKN2B, RB1, CDK4}, and {CDKN2A, TP53, DTX3}. Note that due to differentcopy number processing, the genes CDKN2A and CDKN2B appear as different mutation classes in theGBM(2008) data, while they were merged in a metagene in the GBM data presented above. Beyond thisdifference, the triple {CDKN2B, CDK4, RB1} is the same as the triple in the first module found by Multi-Dendrix on the GBM dataset, and contains two known interactions (for further analysis of this module, see§Mutually exclusive sets in Glioblastoma (GBM)). For six of the nine parameter choices, this module alsocontains the well-known cancer gene ERBB2, a member of the RTK pathway important for glioblastoma [7].

The second triple, {CDKN2A, TP53, DTX3 }, contains two frequently mutated GBM cancer genes thatinteract: TP53 and CDKN2A. For six of the nine parameter choices, this module also contains CDC123,although CDC123 is mutated in only two samples. The third module contains both EGFR and NF1, twogenes identified by [29] to be associated with the Classical and Mesenchymal subtypes, respectively. Thesetwo genes account for 52 of the 70 mutated samples in this module, and thus the exclusivity of mutationsin this module is likely due to subtype-specific mutations. Interestingly, the two genes associated with theProneural subtype, IDH1 and PDGFRA, are not found in any of the results (IDH1 was not included in the listof targeted genes for the GBM(2008) dataset). The weight W ′ of all collections identified by Multi-Dendrixare significant (p < 0.0001).

Lung. Figure S7 shows Multi-Dendrix’s results on the Lung dataset. In contrast to the other datasets, theoutput of Multi-Dendrix varies widely with the choice of parameters (e.g. STK11 has 8 edges with weight≥ 0.2 in the summarization graph). There are two large modules: one of size four and one of size thirteen,as well as 10 additional genes that appear with another gene for only one choice of parameters (and thus arenodes of degree 1 in the summarization graph).

The module of size four includes the triple TP53, ATM, and PAK4 found for all values of parameters.This triple is nearly perfectly exclusive; it covers 79 patients (i.e. Γ(M) = 79) with a coverage overlapω(M) = 1. TP53 and ATM are both tumor suppressors that interact in response to DNA damage, and

28

have been shown previously to be recurrently mutated in lung adenocarcinoma [52]. PAK4 is involvedin both cellular survival and angiogenesis, and has been implicated in tumor progression in a variety ofcancers [53], though we did not find reports of its role in lung adenocarcinoma. While PAK4 is mutated inonly three patients in the Lung dataset, it is still significantly exclusive with ATM (p = 0.0008 by Fisher’sexact test), and may deserve further inquiry.

The module of size 13 includes two gene triples that are grouped together for most parameter choices.The triple {EGFR, KRAS, EPHB1} contains the well-known interaction between EGFR and KRAS, twoproteins whose mutations are important for lung adenocarcinoma [27]. Similar to EGFR, EPHB1 is a re-ceptor tyrosine kinase, and mutations in this gene (as in mutations in EGFR) were reported to be associatedwith higher copy number and mRNA expression levels in [27]. The triple {STK11, NF1, NTRK1} con-tains two well-known cancer genes: STK11 and NF1. Mutations in STK11 have long been a hallmark oflung adenocarcinoma, but the discovery of mutations in the tumor suppressor NF1 was more recent [27].While the genes in this triple have no known interactions, the triple is nearly perfectly exclusive (coverageΓ(M) = 54; coverage overlap ω(M) = 1) which indicates that lung adenocarcinoma patients need inac-tivating mutations in just one of these genes. Thus, they may have an uncharacterized relationship in lungadenocarcinoma.

While Multi-Dendrix’s results on the Lung dataset are promising, we did not conduct further analysisor comparison of these results because of the inconsistency across parameter choices. We note that a likelyexplanation for this inconsistency is that the processed data includes a relatively small number of 163 sam-ples and only includes mutations from targeted sequencing of 190 genes, missing other mutations and copynumber aberrations in these samples.

Comparison of Multi-Dendrix and Iter-Dendrix results

We compare the gene sets found by Multi-Dendrix and those found by Iter-Dendrix on the GBM(2008),GBM, and BRCA datasets, computing the overlap between the results of the algorithms and comparing thenumber of known protein-protein interactions in the results.

Overlap

We compared the distances between collections found by Multi-Dendrix and Iter-Dendrix, using the sym-metric difference function d(M, I) defined in § Simulated data. Table S4 reports these distances. Thereare many differences between the two algorithms on the GBM(2008) and BRCA datasets. On the GBMdataset, there is only one difference: Iter-Dendrix includes the gene IRF5 in the gene set that includes RB1,CDK4(A), and CDKN2A/CDKN2B(D) for kmax = 5.

Interactions

Tables S2 and S3 show the results of the direct interactions test on collections found by Multi-Dendrix andIter-Dendrix (for details of the test, see § Evaluating known interactions). In total, across all three datasets,nearly all collections found by Multi-Dendrix are significantly enriched for interactions (24/27 collections,p < 0.05), slightly more than are found by Iter-Dendrix (20/27 collections, p < 0.05). The largest differenceis on the BRCA dataset, where all collections found by Multi-Dendrix have lower p-values than Iter-Dendrix.In addition, Multi-Dendrix finds all nine collections significantly enriched for interactions (p ≤ 0.05), whileIter-Dendrix finds just two. This difference is amplified by the fact Iter-Dendrix finds five collections withp ≥ 0.15 (by comparison the largest p-value for Multi-Dendrix on BRCA is 0.01). Thus, overall Multi-Dendrix finds collections of gene sets that are more enriched for interactions.

We do not compare the enrichment for interactions of individual gene sets found by Multi-Dendrix andIter-Dendrix. Iter-Dendrix is a greedy algorithm and is thus guaranteed to find the same gene sets multiple

29

times as we vary t. Thus, it is more reasonable to evaluate the stable modules for enrichment for interactions,as presented in the §Mutually exclusive sets in Glioblastoma (GBM) and §Mutually exclusive sets in BreastCancer (BRCA).

Comparison of Multi-Dendrix and RME

We compared Multi-Dendrix and Iter-RME on the GBM(2008), GBM, and BRCA datasets. We ran RMEwith default parameter values, including the default minimum mutation frequency of ≥ 10%. After remov-ing genes mutated in fewer than 10% of samples, the GBM(2008), GBM, and BRCA datasets contains 18,10, and 28 genes, respectively. As a result, using the same parameters as above, RME can find at most threegene sets in the GBM dataset of the minimum size 3, since there are only 10 genes with mutation frequen-cies ≥ 10% the GBM dataset. We ran RME for up to t = 4 gene sets of size kmax ∈ [2, 5] so that RMEcould find t = 4 gene sets in the GBM and BRCA data. Note that unlike Dendrix, RME has a parameterfor setting the maximum gene set size, equivalent to kmax, and returns gene sets of size [2, kmax]. Thus,at each iteration, we chose the gene set P, 2 ≤ |P | ≤ kmax, such that P had the largest RME algorithmicsignificance score.

For all values of the parameters t and kmax, Iter-RME finds collections where each gene set has size2. This is due to a combination of RME’s algorithmic significance score, which is highest for gene setsof size 2, and also because there are only 10-28 genes in each dataset. It is difficult to compare Multi-Dendrix and Iter-RME using the tests presented in Section , as Iter-RME finds only 4 gene sets of size twoin the GBM(2008), GBM, and BRCA datasets. It is unclear how interesting these pairs are: only three pairsof genes from all datasets interact in either the iRefIndex or KEGG protein-protein interaction networks,despite these pairs being chosen from the most highly mutated genes in each dataset (Table S7). Thus, theruntime requirements of RME limit its use to only a small subset of of highly mutated genes, and precludesRME from finding interesting gene sets of size greater than 2 on the tested datasets. In contrast, with thesame parameters, Multi-Dendrix finds larger gene sets of up to 5 genes, many of which show enrichment forinteractions (see Results).

Subtype analysis

GBM subtypes

We annotate the Multi-Dendrix results for associations with known subtypes. [29] derived four subtypesof GBM from gene expression data, and these classifications have become the standard in TCGA analysis.However, we were unable to directly use the sample annotations from [29] for the GBM mutation datasetsince there are only 39 samples in common between the two datasets. Instead, we identify whether thegene sets produced by Multi-Dendrix contain any of the subtype-specific genes (IDH1, PDGFRA, EGFR,and NF1) reported in [29]. Figure 2 shows that one of the four modules found by Multi-Dendrix containsthree of these four genes (all except IDH1). These genes are all members of the RTK/RAS/PI(3)K signalingpathway as annotated in [7], although they do not interact in iREFIndex or KEGG.

We also considered associations between subtypes defined by other automated clustering of gene ex-pression data from TCGA. We downloaded GBM subtypes generated from consensus hierarchical clustersof mRNA expression data from the Broad’s Firehose website. This dataset contains subtype information for529 samples, 224 of which appear in our GBM mutation dataset, and maps patients to one of three subtypes(“1”, “2”, or “3”). Note that [29] identified four subtypes, and thus there is no one-to-one correspondencebetween the two subtype classifications. We determined if mutations in a gene were associated with subtypeusing Fisher’s exact test. Table S5, lists all significant associations (P < 0.01 following Bonferroni multi-ple hypothesis correction). Notably, Consensus 1 and Consensus 2 have no significant associations, whilethe Consensus 3 subtype has all eight of the the significant associations. These include the markers IDH1

30

http://gdac.broadinstitute.org/runs/analyses__latest/reports/cancer/GBM/index.html

and PDGFRA in the Proneural subtype of [29]. The other two subtype markers found by [29], EGFR andNF1, which characterize the Classic and Mesenchymal subtype, do not appear associated with any of theConsensus clusters. Because the overlap between these clusters and those of [29] was imperfect, we usedthe genes that [29] reported to have subtype-specific mutations.

BRCA subtypes

We ran Multi-Dendrix on BRCA mutation data restricted to patients from each of the subtypes detailed in [8].Because the number of patients within each subtype is not distributed evenly, we varied Multi-Dendrix’sexclusivity parameter α for each subtype (Luminal A: α = 2; Luminal B: α = 1.5; HER2-enriched: α = 1;Basal-like: α = 1). We discuss these results in § Mutually exclusive sets in Breast Cancer (BRCA). Inaddition, we report all significant associations (significance level: p < 0.01) between genes or mutationclasses and BRCA subtypes in Table S6.

Multi-Dendrix results without SNV filtering

We tested Multi-Dendrix on the GBM and BRCA mutation matrices constructed without any filtering ofSNV date. Thus, we included all 10,987 (respectively 12,248) genes with at least one non-synonymousSNV in any sample. We also included the copy number aberrations output from GISTIC as described in§ Constructionofmutationmatrices. The resulting mutation matrices included 11,023 mutation classesin 261 GBM samples and 12,281 mutation classes in 507 BRCA samples. We report the stable modulesidentified by Multi-Dendrix here. The modules identified on the mutation data without SNV processinghave symmetric distance ∆ of 7 and 36 from the modules identified from the mutation data with SNVprocessing. In the GBM data, the differences are minor: for all four modules reported in the main text, atleast four genes are in common with the corresponding module found with the unfiltered mutation data. Inparticular, the Multi-Dendrix results with the unfiltered data contain the following changes:

• LMNB2 and SUSD3 appear in the module containing CDKN2A(D), CDK4(A), RB1, and MSL3.

• KRTAP5-5 replaces NLRP3 in the module containing TP53, MDM4(A), MDM2(A), and NPAS3(D).

• HMCN1 replaces PIK3R1 for two parameter choices in the module containing PIK3CA, PTEN(D),PTEN, IDH1, and PRDM2(A).

• MDN1 appears in the module containing EGFR, PDGFRA(A), and RB1(D).

On the BRCA data, the differences are larger: for three of the four modules reported in the main text, onlytwo of the genes in those modules are placed in the same module when Multi-Dendrix is run on the unfilteredmutation data. However, the fourth module (TP53, GATA3, CDH1, CTCF, and GPRIN2) remains exactlythe same. In particular, the Multi-Dendrix results with the unfiltered data contain the following changes:

• HUWEI, SVIL, and LPHN3 replace AKT1, PIK3R1, 12p13.33(A), and HIF3A in the module con-taining PTEN(D) and PIK3CA.

• LAMA2 and USP34 replace MAP2K4 and GRID1 in the module containing CCND1(A) and RB1.

• TLN1, MYH7, DNAH1, and NOTCH4 replace SMARCA4, PPEF1, and WWP2 in the module con-taining MAP2K4(D) and MAP3K1.

Many of the genes that appear only in the modules found with unfiltered data are long genes or geneswith high background mutation rates that are more likely to be mutated in more samples than other genes.

31

The fifteen genes that only appear in modules on the unfiltered GBM and BRCA data have an averagelength of greater than 8500, and an average BMR of greater than 5 × 10−6 (BMRs were calculated fromSNV data from twelve cancers). Such outlier genes are a potential source of false positives, not only forMulti-Dendrix, but also for any method that attempts to identify cancer genes by their recurrence acrosssamples [6]. Multi-Dendrix identified the stable modules in 6,863 and 14,491 seconds, respectively.

References

52. Greulich H (2010) The genomics of lung adenocarcinoma: opportunities for targeted therapies. Genes& cancer 1: 1200-10.

53. Kesanakurti D, Chetty C, Rajasekhar Maddirela D, Gujrati M, Rao JS (2012) Functional cooperativityby direct interaction between PAK4 and MMP-2 in the regulation of anoikis resistance, migration andinvasion in glioma. Cell death & disease 3: e445.

32

4

Supporting Tables

m 100 200 400 800 1600 3200 6400 12800 22000q = 0.02 51.8 93.4 185.4 357.8 722.7 1409.3 2826.6 5663.4 9698.4q = 0.0001 24.2 34.9 54 88.1 169.1 312.9 617.2 1236 2108.4

Table S1: Average number of genes input to Multi-Dendrix across 100 simulated datasets. We use aminimum threshold of nq, the expected number of n total samples with a mutation. We report the valuesfor different number m of genes and the lowest and (q = 0.0001) highest (q = 0.02) passenger mutationprobabilities. On all simulated datasets, Multi-Dendrix’s average runtime is < 1 hour.

33

GBM(2008) GBM BRCAkmax 3 4 5 3 4 5 3 4 5t = 2 0.005 0.026 0.027 < 0.001 < 0.001 < 0.001 0.003 0.007 0.006t = 3 < 0.001 0.021 0.375 < 0.001 0.001 0.001 < 0.001 0.003 0.007t = 4 0.253 0.036 0.314 0.013 0.006 0.002 < 0.001 0.001 0.01

Table S2: p-values for the number of observed protein-protein interactions in Multi-Dendrix results (directinteractions test) for different values of parameters t, the number of gene sets, and kmax, the maximumgene set size. The minimum gene set size, kmin = 3 for all runs. The p-values were calculated from 1000permuted networks constructed from the union of the KEGG and iRefIndex PPI networks.

GBM(2008) GBM BRCAk 3 4 5 3 4 5 3 4 5

t = 2 0.005 0.026 0.032 < 0.001 < 0.001 < 0.001 0.166 0.17 0.184t = 3 < 0.001 0.021 0.046 < 0.001 0.001 0.001 0.05 0.26 0.72t = 4 0.001 0.002 0.043 0.013 0.006 0.002 0.003 0.005 0.103

Table S3: p-values for the number of observed protein-protein interactions in Iter-Dendrix results (directioninteractions test) for different values of parameters t, the number of gene set, and kmax, the maximum geneset size. The minimum gene set size, kmin = 3 for all runs. The p-values were calculated from 1000permuted networks constructed from the union of the KEGG and iRefIndex PPI networks.

GBM(2008) GBM BRCAkmax 3 4 5 3 4 5 3 4 5t = 2 0 0 3 0 0 1 6 8 10t = 3 6 6 13 0 0 1 9 12 15t = 4 4 10 15 0 0 1 12 16 20

Table S4: The size of the symmetric difference d(M, I) for collections M, I found by Multi-Dendrix andIter-Dendrix, respectively.

Gene Subtype p-value

IDH1 Consensus 3 1.24E-05ARID2(D) Consensus 3 3.66E-05

PDGFRA(A) Consensus 3 3.79E-05CDK4(A) Consensus 3 0.00098192

KIF4B Consensus 3 0.001198772BBS1 Consensus 3 0.006889494TP53 Consensus 3 0.008230181

MET(A) Consensus 3 0.009878469

Table S5: Significant associations (p < 0.01) between mutations in genes (SNVs, amplifications “(A)”, ordeletions “(D)”) and three subtypes from GBM consensus clusters. p-values were calculated using Fisher’sexact test with a Bonferroni correction for multiple hypotheses.

34

Gene Subtype p-value

TP53 Basal-like 1.53E-1812p13.33(A) Basal-like 2.17E-10

PTEN(D) Basal-like 3.16E-06MAP3K1(D) Basal-like 7.93E-06

5q21.3(D) Basal-like 0.000109176PIK3CA(A) Basal-like 0.000417525

RB1(D) Basal-like 0.00167168911p13(A) Basal-like 0.002899894FGF2(A) Basal-like 0.003068744EPS8L1 Basal-like 0.00649918MYC(A) Basal-like 0.007955101IGF1R(A) Basal-like 0.008512299

ERBB2(A) HER2-enriched 5.07E-26TP53 HER2-enriched 8.64E-09

MIR21(A) HER2-enriched 4.24E-086q21(A) HER2-enriched 0.001672642

4q13.3(A) HER2-enriched 0.002513158SRPR HER2-enriched 0.003786332

ATP1A4 HER2-enriched 0.007586379ERBB3 HER2-enriched 0.007586379

PIK3CA Luminal A 7.39E-06MAP3K1 Luminal A 9.94E-05

CCND1(A) Luminal B 1.36E-08KCNB2 Luminal B 0.001123083

3p25.1(A) Luminal B 0.001333521

Table S6: Significant associations (p < 0.01) between mutations (SNVs, amplifications “(A)”, or deletions“(D)”) in the four BRCA subtypes. p-values were calculated using Fisher’s exact test with a Bonferronicorrection for multiple hypotheses.

35

Dataset Iteration Genes InteractGBM(2008) 1 RB1, CDKN2B No

2 NF1, EGFR No3 OS9, CDKN2A No4 TP53, MDM2 Yes

GBM 1 CDK4(A), CDKN2A CDKN2B(D) Yes2 PTEN, PTEN(D) N/A3 MDM4(A), TP53 Yes4 EGFR, PDGFRA(A) No

BRCA 1 TP53, GATA3 No2 PIK3CA, PTEN(D) Yes3 MAP2K4(D), FOXA1(A) No4 8p11.23(A), ERBB2(D) No

Table S7: Gene sets found by Iter-RME in the GBM(2008), GBM, and BRCA datasets (after removinggenes with mutation frequency < 10%) for maximum gene set size kmax = 2, . . . , 5 and and number tof gene sets t = 2, . . . , 4. For all values of kmax, Iter-RME returned only gene sets of size 2. “Iteration”column denotes the index of each gene set Pi returned in each iteration of RME. Only 4/12 gene sets containan interacting pair of genes according to the union of the KEGG and iRefIndex protein-protein interactionnetwork.

36

Supporting Figures

0.0001 0.0005 0.001 0.005 0.01 0.015 0.020.0

0.50

1.0

1.5

2.0

2.5

3.0

3.5

Log1

0 ru

ntim

e (s

ecs)

Passenger mutation probability q

2200

0 ge

nes

1280

0 ge

nes

6400

gen

es32

00 g

enes

1600

gen

es80

0 ge

nes

400

gene

s20

0 ge

nes

100

gene

s

Figure S1: Multi-Dendrix runtimes on simulated datasets. Average runtime (secs) of Multi-Dendrixon simulated data with 1000 patients and varying passenger mutation probabilities q and number of genes.For each set of parameters, average runtime across 10 simulations is reported. The average runtime ofMulti-Dendrix is under 3500 seconds for each dataset.

37

TCG

A-0

6-08

78TC

GA

-06-

0879

TCG

A-2

8-17

51TC

GA

-06-

0122

TCG

A-2

8-52

18TC

GA

-06-

0169

TCG

A-0

6-01

26TC

GA

-87-

5896

TCG

A-1

2-52

99TC

GA

-06-

0158

TCG

A-1

4-18

23TC

GA

-06-

0151

TCG

A-0

6-01

39TC

GA

-06-

5414

TCG

A-0

6-01

37TC

GA

-28-

5204

TCG

A-2

8-25

14TC

GA

-19-

1385

TCG

A-0

6-07

45TC

GA

-28-

2506

TCG

A-3

2-26

15TC

GA

-28-

1755

TCG

A-0

6-07

43TC

GA

-76-

4934

TCG

A-0

6-02

10TC

GA

-06-

1087

TCG

A-1

2-06

70TC

GA

-32-

1976

TCG

A-0

6-01

73TC

GA

-06-

0214

TCG

A-1

4-08

17TC

GA

-76-

4935

TCG

A-0

6-08

75TC

GA

-14-

1827

TCG

A-1

9-17

86TC

GA

-14-

2554

TCG

A-2

7-18

32TC

GA

-12-

0829

TCG

A-1

9-13

90TC

GA

-15-

0742

TCG

A-1

4-18

21TC

GA

-06-

0750

TCG

A-3

2-52

22TC

GA

-06-

0749

TCG

A-1

2-06

92TC

GA

-27-

2524

TCG

A-7

6-61

91TC

GA

-32-

1979

TCG

A-0

6-01

25TC

GA

-19-

1387

TCG

A-1

4-18

25TC

GA

-27-

1834

TCG

A-1

4-10

37TC

GA

-28-

5216

TCG

A-0

6-06

49TC

GA

-06-

0876

TCG

A-1

9-26

29TC

GA

-14-

0813

TCG

A-1

9-13

89TC

GA

-76-

4928

TCG

A-1

2-08

21TC

GA

-12-

3652

TCG

A-2

8-25

01TC

GA

-14-

1455

TCG

A-1

6-08

61TC

GA

-28-

5209

TCG

A-1

4-18

29TC

GA

-27-

1831

TCG

A-1

4-17

95TC

GA

-19-

1791

TCG

A-2

7-25

28TC

GA

-41-

2575

TCG

A-0

6-02

41TC

GA

-06-

2558

TCG

A-0

6-02

37TC

GA

-06-

2563

TCG

A-1

2-08

19TC

GA

-28-

5220

TCG

A-0

6-01

71TC

GA

-26-

5139

TCG

A-0

6-18

01TC

GA

-06-

0216

TCG

A-7

6-61

92TC

GA

-14-

0865

TCG

A-2

8-17

57TC

GA

-06-

0190

TCG

A-0

6-01

66TC

GA

-06-

0155

TCG

A-0

2-24

66TC

GA

-12-

0828

TCG

A-0

6-01

92TC

GA

-28-

1747

TCG

A-1

4-07

90TC

GA

-12-

5301

TCG

A-1

4-08

71TC

GA

-81-

5910

TCG

A-0

6-25

65TC

GA

-12-

1601

TCG

A-7

6-49

31TC

GA

-28-

1749

TCG

A-2

7-25

21TC

GA

-02-

2483

TCG

A-1

2-10

93TC

GA

-06-

0646

TCG

A-1

4-13

96TC

GA

-76-

4929

TCG

A-1

2-08

27TC

GA

-12-

1599

TCG

A-0

6-01

54TC

GA

-15-

1446

TCG

A-1

2-06

15TC

GA

-28-

6450

TCG

A-1

2-10

88TC

GA

-12-

3649

TCG

A-0

2-24

70TC

GA

-28-

1750

TCG

A-2

7-18

36TC

GA

-76-

4925

TCG

A-0

2-24

86TC

GA

-28-

1753

TCG

A-0

6-01

88TC

GA

-06-

0644

TCG

A-0

6-01

45TC

GA

-06-

2566

TCG

A-1

2-36

53TC

GA

-06-

0219

TCG

A-0

6-02

11TC

GA

-26-

5134

TCG

A-1

4-34

77TC

GA

-27-

1830

TCG

A-1

2-10

89TC

GA

-14-

0786

TCG

A-1

4-07

89TC

GA

-32-

1982

TCG

A-1

9-26

20TC

GA

-19-

1789

TCG

A-7

6-62

82TC

GA

-28-

5208

TCG

A-0

6-07

47TC

GA

-12-

3648

TCG

A-1

2-36

51TC

GA

-06-

0648

TCG

A-1

4-07

87TC

GA

-06-

5415

TCG

A-4

1-33

92TC

GA

-06-

0185

TCG

A-1

9-26

19TC

GA

-32-

2638

TCG

A-1

2-52

95TC

GA

-19-

1386

TCG

A-0

6-08

77TC

GA

-06-

2562

TCG

A-2

8-25

13TC

GA

-12-

0688

TCG

A-2

7-25

27TC

GA

-02-

0047

TCG

A-7

6-62

85TC

GA

-27-

2523

TCG

A-3

2-19

86TC

GA

-76-

4926

TCG

A-0

6-25

64TC

GA

-12-

3650

TCG

A-2

8-52

15TC

GA

-12-

0822

TCG

A-2

6-17

99TC

GA

-06-

0645

TCG

A-0

6-54

18TC

GA

-06-

2561

TCG

A-2

8-25

02TC

GA

-16-

1045

TCG

A-0

6-08

82TC

GA

-28-

2510

TCG

A-0

6-01

24TC

GA

-14-

0866

TCG

A-2

7-18

37TC

GA

-27-

1833

TCG

A-1

9-59

60TC

GA

-06-

0221

TCG

A-2

8-52

07TC

GA

-06-

2559

TCG

A-0

2-24

85TC

GA

-06-

0157

TCG

A-1

6-08

46TC

GA

-16-

0848

TCG

A-2

7-18

38TC

GA

-12-

0616

TCG

A-4

1-25

71TC

GA

-19-

0957

TCG

A-0

2-00

03TC

GA

-06-

2570

TCG

A-1

2-16

02TC

GA

-26-

5135

TCG

A-7

6-61

93TC

GA

-27-

1835

TCG

A-2

7-25

18TC

GA

-06-

2569

TCG

A-0

6-01

28TC

GA

-16-

0850

TCG

A-0

6-06

86TC

GA

-41-

5651

TCG

A-2

7-25

26TC

GA

-14-

1451

TCG

A-4

1-25

73TC

GA

-06-

0129

TCG

A-2

8-17

46TC

GA

-12-

0618

TCG

A-1

2-08

26TC

GA

-28-

5214

TCG

A-0

6-02

09TC

GA

-06-

0152

TCG

A-1

9-26

24TC

GA

-06-

0238

TCG

A-1

2-08

20TC

GA

-32-

2632

TCG

A-0

6-54

16TC

GA

-26-

5132

TCG

A-3

2-19

70TC

GA

-19-

2623

TCG

A-1

4-08

67TC

GA

-06-

1800

TCG

A-0

6-02

13TC

GA

-14-

0812

TCG

A-1

2-16

00TC

GA

-06-

0140

TCG

A-3

2-26

16TC

GA

-06-

0130

TCG

A-3

2-26

34TC

GA

-19-

1787

TCG

A-0

6-01

95TC

GA

-06-

2567

TCG

A-1

2-36

46TC

GA

-26-

5136

TCG

A-1

4-14

53TC

GA

-19-

5959

TCG

A-1

4-10

34TC

GA

-19-

2625

TCG

A-1

4-17

94TC

GA

-27-

2519

TCG

A-1

2-15

98TC

GA

-19-

2621

TCG

A-0

6-10

84TC

GA

-06-

0184

TCG

A-0

2-00

55TC

GA

-28-

5213

TCG

A-2

8-52

19TC

GA

-14-

1456

TCG

A-1

4-14

58TC

GA

-26-

1442

TCG

A-1

2-08

18TC

GA

-06-

0744

TCG

A-2

6-51

33TC

GA

-12-

1092

TCG

A-1

2-06

19TC

GA

-16-

0849

TCG

A-1

9-13

88TC

GA

-06-

0240

TCG

A-0

6-08

81TC

GA

-06-

1802

CDKN2A(D)

CDK4(A)

RB1

MSL3

PTEN

PTEN(D)

PIK3CA

PIK3R1

IDH1

PRDM2(A)

TP53

MDM4(A)

MDM2(A)

NPAS3(D)

NLRP3

EGFR

PDGFRA(A)

RB1(D)


Exclusive mutations(all modules)

Co-occurring mutations(same module only)

Amp Del SNVExclusive mutations(same module only)

Amp Del SNV Amp Del SNV

Figure S2: Multi-Dendrix results on GBM dataset as one mutation matrix. The four modules areshown in alternating gray backgrounds. For each module, patients are sorted first by those with mutationsin only the module (dark blue), then by patients that have just one mutation in the module (light blue), andfinally by patients that have co-occurring mutations in that module (orange). Upticks indicate amplification,downticks indicate deletions, and full ticks indicate SNVs. The four modules found by Multi-Dendrix onthe GBM dataset cover 98.8% (258/261) of the patients in the GBM mutation data.

TCG

A-E

2-A

15L

TCG

A-A

O-A

0J7

TCG

A-B

H-A

0H9

TCG

A-A

8-A

09D

TCG

A-A

7-A

0CH

TCG

A-A

7-A

0CG

TCG

A-A

8-A

0A1

TCG

A-B

H-A

0H5

TCG

A-B

H-A

1ET

TCG

A-E

2-A

15P

TCG

A-A

8-A

07J

TCG

A-A

N-A

0FF

TCG

A-B

H-A

0B5

TCG

A-A

2-A

0EM

TCG

A-B

H-A

18J

TCG

A-A

2-A

0CS

TCG

A-A

R-A

1AO

TCG

A-E

2-A

10B

TCG

A-B

H-A

0DO

TCG

A-A

8-A

08T

TCG

A-B

H-A

0DE

TCG

A-B

6-A

0RN

TCG

A-B

6-A

0RO

TCG

A-B

6-A

0RP

TCG

A-A

8-A

08B

TCG

A-B

6-A

0IC

TCG

A-B

H-A

0BM

TCG

A-B

H-A

0BJ

TCG

A-B

H-A

0DV

TCG

A-B

H-A

0BQ

TCG

A-A

8-A

08L

TCG

A-B

H-A

0HI

TCG

A-A

8-A

096

TCG

A-A

2-A

0YL

TCG

A-A

8-A

06R

TCG

A-A

O-A

125

TCG

A-A

N-A

0XL

TCG

A-A

O-A

0JC

TCG

A-E

2-A

156

TCG

A-A

8-A

06Q

TCG

A-A

2-A

0EQ

TCG

A-B

H-A

0BA

TCG

A-A

N-A

0AJ

TCG

A-A

2-A

0EY

TCG

A-C

8-A

130

TCG

A-A

8-A

093

TCG

A-B

H-A

0DK

TCG

A-A

2-A

0YK

TCG

A-C

8-A

12L

TCG

A-A

N-A

03Y

TCG

A-A

2-A

0T4

TCG

A-B

6-A

0RH

TCG

A-A

8-A

08R

TCG

A-A

8-A

07B

TCG

A-B

H-A

0BC

TCG

A-A

2-A

04P

TCG

A-E

2-A

10C

TCG

A-B

H-A

18H

TCG

A-A

R-A

1AV

TCG

A-A

O-A

03O

TCG

A-A

O-A

03L

TCG

A-B

H-A

0DL

TCG

A-A

7-A

0CJ

TCG

A-B

6-A

0IE

TCG

A-A

O-A

0JF

TCG

A-B

6-A

0RE

TCG

A-A

8-A

092

TCG

A-A

N-A

03X

TCG

A-E

2-A

1BC

TCG

A-B

6-A

0RG

TCG

A-B

H-A

0DP

TCG

A-B

H-A

0W7

TCG

A-A

2-A

0YC

TCG

A-A

2-A

04V

TCG

A-A

R-A

0TZ

TCG

A-B

H-A

0DD

TCG

A-C

8-A

12Y

TCG

A-B

H-A

0HU

TCG

A-A

N-A

0FS

TCG

A-C

8-A

12T

TCG

A-A

8-A

0A9

TCG

A-A

2-A

0EW

TCG

A-D

8-A

143

TCG

A-B

6-A

0IP

TCG

A-A

R-A

1AS

TCG

A-B

H-A

0GY

TCG

A-E

2-A

15E

TCG

A-B

H-A

0BF

TCG

A-A

8-A

095

TCG

A-A

8-A

081

TCG

A-A

8-A

08P

TCG

A-A

O-A

0JE

TCG

A-A

2-A

0CW

TCG

A-B

H-A

0W5

TCG

A-A

2-A

0YD

TCG

A-A

2-A

0EX

TCG

A-B

H-A

18M

TCG

A-B

H-A

0BT

TCG

A-E

2-A

10F

TCG

A-B

H-A

0DT

TCG

A-A

2-A

0SY

TCG

A-A

8-A

07E

TCG

A-E

2-A

14Y

TCG

A-A

2-A

0D1

TCG

A-C

8-A

1HF

TCG

A-A

8-A

09B

TCG

A-A

2-A

0EV

TCG

A-B

H-A

0HN

TCG

A-A

8-A

07G

TCG

A-A

O-A

03N

TCG

A-A

2-A

04W

TCG

A-A

N-A

04C

TCG

A-B

H-A

0HA

TCG

A-A

8-A

07Z

TCG

A-E

2-A

15K

TCG

A-E

2-A

14V

TCG

A-A

2-A

0YI

TCG

A-E

2-A

154

TCG

A-B

H-A

0BV

TCG

A-A

8-A

09C

TCG

A-B

H-A

0BD

TCG

A-E

2-A

105

TCG

A-E

2-A

10E

TCG

A-B

H-A

18I

TCG

A-A

O-A

126

TCG

A-B

6-A

0WW

TCG

A-B

H-A

0W3

TCG

A-A

O-A

0JA

TCG

A-B

6-A

0X5

TCG

A-E

2-A

15G

TCG

A-E

2-A

15C

TCG

A-A

N-A

0XS

TCG

A-E

2-A

1BD

TCG

A-E

2-A

1B4

TCG

A-B

H-A

0BO

TCG

A-A

2-A

0CP

TCG

A-A

8-A

06P

TCG

A-C

8-A

12N

TCG

A-A

2-A

04N

TCG

A-A

8-A

08Z

TCG

A-A

N-A

049

TCG

A-A

7-A

0DB

TCG

A-E

2-A

153

TCG

A-A

N-A

0XO

TCG

A-A

2-A

0EN

TCG

A-A

2-A

0EO

TCG

A-A

R-A

0TR

TCG

A-A

N-A

0XP

TCG

A-B

6-A

0X7

TCG

A-B

H-A

0DX

TCG

A-A

2-A

0T7

TCG

A-B

6-A

0WY

TCG

A-A

7-A

0D9

TCG

A-A

8-A

08O

TCG

A-A

8-A

0A7

TCG

A-A

R-A

1AL

TCG

A-B

H-A

18F

TCG

A-B

6-A

0RQ

TCG

A-B

H-A

0H3

TCG

A-A

O-A

12A

TCG

A-A

2-A

0D3

TCG

A-B

H-A

0EA

TCG

A-B

6-A

0X0

TCG

A-B

6-A

0I5

TCG

A-C

8-A

133

TCG

A-A

2-A

0YH

TCG

A-B

H-A

0B0

TCG

A-A

8-A

075

TCG

A-B

H-A

0EB

TCG

A-E

2-A

108

TCG

A-C

8-A

131

TCG

A-A

8-A

06U

TCG

A-A

8-A

084

TCG

A-A

R-A

1AW

TCG

A-B

6-A

0IK

TCG

A-B

H-A

0EE

TCG

A-C

8-A

12U

TCG

A-B

H-A

1EU

TCG

A-A

7-A

13F

TCG

A-A

8-A

09A

TCG

A-A

8-A

06Y

TCG

A-A

8-A

08J

TCG

A-B

6-A

0X1

TCG

A-B

H-A

18T

TCG

A-B

H-A

1EV

TCG

A-A

8-A

08G

TCG

A-B

H-A

0AV

TCG

A-A

N-A

0FJ

TCG

A-B

6-A

0I9

TCG

A-A

8-A

07R

TCG

A-E

2-A

14P

TCG

A-B

H-A

18V

TCG

A-A

R-A

0U3

TCG

A-A

1-A

0SJ

TCG

A-A

2-A

0D2

TCG

A-A

R-A

0TV

TCG

A-A

R-A

1AH

TCG

A-A

N-A

0AM

TCG

A-A

O-A

0J2

TCG

A-A

N-A

0XU

TCG

A-C

8-A

134

TCG

A-E

2-A

14R

TCG

A-A

Q-A

04J

TCG

A-B

H-A

1ES

TCG

A-B

6-A

0WV

TCG

A-A

8-A

09Q

TCG

A-B

H-A

0B9

TCG

A-B

6-A

0I2

TCG

A-E

2-A

109

TCG

A-A

R-A

0U4

TCG

A-C

8-A

12W

TCG

A-E

2-A

158

TCG

A-E

2-A

150

TCG

A-A

R-A

0TY

TCG

A-A

8-A

07U

TCG

A-D

8-A

147

TCG

A-A

1-A

0SO

TCG

A-A

N-A

0FT

TCG

A-B

6-A

0RS

TCG

A-C

8-A

12K

TCG

A-E

2-A

14N

TCG

A-A

R-A

1AI

TCG

A-A

R-A

1AY

TCG

A-A

2-A

0YJ

TCG

A-A

N-A

0AR

TCG

A-A

O-A

124

TCG

A-A

O-A

129

TCG

A-A

O-A

12F

TCG

A-B

H-A

18R

TCG

A-B

H-A

1EW

TCG

A-A

8-A

07L

TCG

A-A

1-A

0SH

TCG

A-E

2-A

1AZ

TCG

A-A

2-A

0CU

TCG

A-B

H-A

0HW

TCG

A-A

2-A

0D0

TCG

A-A

R-A

0TP

TCG

A-D

8-A

13Z

TCG

A-A

1-A

0SK

TCG

A-B

H-A

0BL

TCG

A-B

H-A

18Q

TCG

A-B

6-A

0IJ

TCG

A-A

O-A

0JL

TCG

A-A

N-A

0AL

TCG

A-A

8-A

09T

TCG

A-B

H-A

1EO

TCG

A-A

R-A

1AR

TCG

A-A

8-A

094

TCG

A-A

8-A

09M

TCG

A-A

8-A

08I

TCG

A-A

O-A

0JD

TCG

A-B

H-A

0BW

TCG

A-E

2-A

15A

TCG

A-A

1-A

0SD

TCG

A-B

6-A

0RL

TCG

A-A

8-A

08F

TCG

A-A

8-A

06N

TCG

A-E

2-A

159

TCG

A-A

O-A

03P

TCG

A-A

2-A

0CL

TCG

A-C

8-A

1HM

TCG

A-A

8-A

06O

TCG

A-A

N-A

0AT

TCG

A-A

R-A

0TW

TCG

A-B

H-A

0AW

TCG

A-A

O-A

12D

TCG

A-B

H-A

0B3

TCG

A-A

N-A

0FX

TCG

A-E

2-A

14Z

TCG

A-A

7-A

13D

TCG

A-A

2-A

0T2

TCG

A-E

2-A

14X

TCG

A-A

N-A

0XN

TCG

A-B

H-A

0HP

TCG

A-A

R-A

0U0

TCG

A-B

H-A

0GZ

TCG

A-A

2-A

0T0

TCG

A-B

6-A

0IO

TCG

A-A

8-A

0A2

TCG

A-A

O-A

0J6

TCG

A-A

R-A

0TS

TCG

A-A

2-A

0SV

TCG

A-B

6-A

0RU

TCG

A-A

R-A

0U1

TCG

A-B

H-A

18K

TCG

A-E

2-A

14Q

TCG

A-B

H-A

18P

TCG

A-A

O-A

0JG

TCG

A-E

2-A

14S

TCG

A-B

H-A

0DI

TCG

A-A

2-A

0SW

TCG

A-C

8-A

135

TCG

A-A

2-A

0CQ

TCG

A-A

R-A

0TQ

TCG

A-C

8-A

12Z

TCG

A-A

8-A

099

TCG

A-D

8-A

141

TCG

A-A

O-A

12H

TCG

A-B

H-A

0BR

TCG

A-A

N-A

0XR

TCG

A-E

2-A

15I

TCG

A-B

H-A

0W4

TCG

A-B

6-A

0WZ

TCG

A-E

2-A

107

TCG

A-B

H-A

0BS

TCG

A-A

O-A

0J4

TCG

A-A

2-A

04U

TCG

A-E

2-A

1B6

TCG

A-B

H-A

0RX

TCG

A-B

6-A

0WX

TCG

A-A

7-A

0CE

TCG

A-A

N-A

0G0

TCG

A-B

H-A

0B7

TCG

A-B

H-A

0WA

TCG

A-D

8-A

142

TCG

A-B

H-A

1F0

TCG

A-B

6-A

0IQ

TCG

A-A

R-A

0TX

TCG

A-A

2-A

0T1

TCG

A-B

H-A

0BG

TCG

A-B

H-A

0C0

TCG

A-C

8-A

12P

TCG

A-A

7-A

0DA

TCG

A-B

H-A

0HY

TCG

A-C

8-A

12O

TCG

A-B

H-A

0E6

TCG

A-E

2-A

152

TCG

A-A

O-A

0JB

TCG

A-B

H-A

0DZ

TCG

A-A

R-A

1AP

TCG

A-A

Q-A

04H

TCG

A-A

R-A

1AQ

TCG

A-A

R-A

1AN

TCG

A-A

8-A

097

TCG

A-B

H-A

0BP

TCG

A-A

N-A

0FY

TCG

A-B

H-A

0HX

TCG

A-A

N-A

0AS

TCG

A-D

8-A

13Y

TCG

A-A

8-A

07W

TCG

A-A

O-A

12G

TCG

A-C

8-A

12Q

TCG

A-A

R-A

0U2

TCG

A-A

8-A

09X

TCG

A-E

2-A

155

TCG

A-E

2-A

15M

TCG

A-A

8-A

079

TCG

A-B

H-A

0C3

TCG

A-C

8-A

138

TCG

A-A

2-A

0T3

TCG

A-C

8-A

1HG

TCG

A-A

Q-A

04L

TCG

A-A

2-A

0SX

TCG

A-A

8-A

07O

TCG

A-B

H-A

0H7

TCG

A-A

N-A

0FL

TCG

A-A

N-A

0FV

TCG

A-A

2-A

0YG

TCG

A-A

N-A

0XT

TCG

A-A

2-A

0YM

TCG

A-A

O-A

03V

TCG

A-B

H-A

18U

TCG

A-E

2-A

1B0

TCG

A-A

8-A

09N

TCG

A-A

2-A

0CM

TCG

A-A

2-A

0CX

TCG

A-A

O-A

03R

TCG

A-B

H-A

0E0

TCG

A-A

R-A

1AJ

TCG

A-A

8-A

08X

TCG

A-E

2-A

15T

TCG

A-A

8-A

083

TCG

A-B

H-A

0HL

TCG

A-E

2-A

15J

TCG

A-E

2-A

15H

TCG

A-A

2-A

0SU

TCG

A-C

8-A

1HI

TCG

A-A

N-A

04A

TCG

A-A

2-A

0CZ

TCG

A-A

8-A

06T

TCG

A-E

2-A

14T

TCG

A-A

2-A

0CV

TCG

A-E

2-A

1B5

TCG

A-C

8-A

1HN

TCG

A-B

H-A

0DS

TCG

A-B

H-A

0C7

TCG

A-A

N-A

0FD

TCG

A-C

8-A

1HL

TCG

A-B

6-A

0IM

TCG

A-A

O-A

0J3

TCG

A-B

H-A

0DG

TCG

A-C

8-A

12M

TCG

A-E

2-A

15D

TCG

A-E

2-A

10A

TCG

A-A

8-A

09R

TCG

A-B

H-A

0AZ

TCG

A-B

6-A

0WS

TCG

A-A

N-A

041

TCG

A-A

O-A

0J8

TCG

A-A

O-A

0J9

TCG

A-A

8-A

09V

TCG

A-B

H-A

0E9

TCG

A-B

6-A

0IH

TCG

A-A

2-A

0T6

TCG

A-A

8-A

0A4

TCG

A-A

1-A

0SE

TCG

A-A

8-A

09W

TCG

A-A

8-A

091

TCG

A-A

8-A

07F

TCG

A-B

H-A

0C1

TCG

A-A

8-A

0AB

TCG

A-A

O-A

0JJ

TCG

A-A

8-A

0AD

TCG

A-A

O-A

12B

TCG

A-B

H-A

0DH

TCG

A-A

2-A

0YF

TCG

A-B

6-A

0RI

TCG

A-A

8-A

08C

TCG

A-A

O-A

03U

TCG

A-A

8-A

090

TCG

A-B

H-A

18S

TCG

A-A

2-A

04R

TCG

A-B

H-A

0BZ

TCG

A-B

H-A

0HB

TCG

A-A

8-A

07I

TCG

A-C

8-A

137

TCG

A-A

7-A

0CD

TCG

A-A

R-A

0TT

TCG

A-B

6-A

0RV

TCG

A-A

8-A

086

TCG

A-E

2-A

14W

TCG

A-A

8-A

06X

TCG

A-A

O-A

03T

TCG

A-E

2-A

14O

TCG

A-B

H-A

0HK

TCG

A-A

N-A

0XW

TCG

A-B

H-A

18L

TCG

A-A

2-A

0D4

TCG

A-A

8-A

09I

TCG

A-A

2-A

0CT

TCG

A-A

8-A

06Z

TCG

A-B

H-A

0E2

TCG

A-C

8-A

132

TCG

A-A

O-A

0JM

TCG

A-B

H-A

0E7

TCG

A-A

2-A

0EU

TCG

A-B

6-A

0RM

TCG

A-B

H-A

0H6

TCG

A-B

H-A

18N

TCG

A-E

2-A

15O

TCG

A-A

8-A

082

TCG

A-A

8-A

07P

TCG

A-A

N-A

0FW

TCG

A-B

H-A

0B8

TCG

A-B

H-A

0DQ

TCG

A-A

O-A

12E

TCG

A-A

8-A

085

TCG

A-A

2-A

0ER

TCG

A-A

1-A

0SM

TCG

A-A

R-A

1AK

TCG

A-E

2-A

15R

TCG

A-B

H-A

0EI

TCG

A-A

R-A

1AU

TCG

A-B

6-A

0X4

TCG

A-B

6-A

0IB

TCG

A-B

6-A

0IG

PIK3CA

12p13.33(A)

PTEN(D)

PIK3R1

AKT1

HIFF3A

TP53

GATA3

CDH1

CTCF

GPRIN2

MAP2K4(D)

MAP3K1

PPEF1

SMARCA4

WWP2

CCND1(A)

MAP2K4

RB1

GRID1


Exclusive mutations(all modules)

Co-occurring mutations(same module only)

Amp Del SNVExclusive mutations(same module only)

Amp Del SNV Amp Del SNV

Figure S3: Multi-Dendrix results on BRCA dataset as one mutation matrix. Representation is as inFigure S2. The four modules found by Multi-Dendrix cover 91.9% (466/507) of the patients in the BRCAmutation data.

38

CCND1(A) (3)

SF3B1 (2)

0.2

13q12.3(A) (3)

0.3

MYC(A) (3)

0.3

8q21.13(A) (6)

MAP3K1 (8)

0.7

MAP2K4(D) (8)

0.7

GRIN2B (2)

0.2

RUNX1 (4)

0.4

0.2

0.2

PTEN(D) (2)

MIR21(A) (2)

0.2

PIK3CA (9)

0.2

MLL3 (1) AKT1 (4)

PIK3R1 (4)

0.3

12p13.33(A) (5)

0.4

0.4

8p11.23(A) (5)

0.4

MAP2K4 (5)

CDH1 (9)

0.6

GATA3 (9)

0.6

TP53 (9)

0.6

CTCF (2)

0.2

0.3

0.2

0.2

0.7

0.2

0.2

0.4

0.4

0.4

0.41.0

1.0

0.2

0.2

CEP68 (1) PPEF1 (1) OPRM1 (1)

1.0

0.2

0.2

0.4

0.2

0.2

0.6

0.6

0.2 0.6

Figure S4: Multi-Dendrix results on the BRCA dataset with α = 1. Four genes (genes with degree zero)are output for only one choice of parameter values. The remaining genes form two connected componentsindicating that the collections are variable across different values of the parameters. In comparison Multi-Dendrix with α = 2.5 (Figure 3) finds more stable results on this dataset.

GRIN2B (6)

MAP3K1 (9)

0.7

8q21.13(A) (9)

0.7

RUNX1 (3)

0.3

MAP2K4(D) (9)

0.7

1.0

0.3

1.0

TP53 (9)

FOXA1 (3)

0.3

PIK3CA (9)

1.0

ABCC12 (6)

0.7

GATA3 (9)

1.0

CCND1(A) (4)

SF3B1 (2)

0.2

MYC(A) (4)

0.4

13q12.3(A) (4)

0.4

0.3

1.0

MAP2K4 (4)

PTPRD(D) (5)

0.4

RB1(D) (2)

0.2

8p11(A) (5)

0.4

CDH1 (5)

0.40.2

0.2

0.2

0.6

0.6

MLL3 (1)

0.3

0.3

0.3 0.2

0.2

0.7

1.0

0.60.3

0.4

0.7

Figure S5: Iter-Dendrix results on the BRCA dataset. The modules identified by Iter-Dendrix group themost frequently mutated genes together (e.g. TP53 and PIK3CA) and thus there is a large number of sampleswith co-occurring mutations in each module. In comparison Multi-Dendrix with α = 2.5 (Figure 3) findsmore stable with less co-occurrence of mutations within modules on this dataset.

39

ERBB2 (6)

RB1 (9)

0.7

CDKN2B (9)

0.7

0.7

DTX3 (9)

CDC123 (6)

0.7

CDKN2A (8)

0.9

TP53 (9)

1.0

DOCK1 (1)

EGFR (4)

NF1 (4)

0.4

KDR (2)

0.2

CDKN2C (2)

0.2

RPS6KA2 (2)

0.2

KIT (2)

0.2

CHI3L1 (1)

0.6

0.7

MGMT-DPYSL4 (3)

MTAP (6)

0.3

PTEN (5)

0.2

1.0

0.9

GLI1 (1)

0.2

0.2

0.20.2

FIP1L1 (1)

0.20.9 0.9

PIM1 (2)

0.6

AVIL-CTDSP2 (2) TAF1 (1)OS9 (1)CDK4 (9)

Figure S6: Multi-Dendrix results on the GBM(2008) dataset.

MYO3B (2)

NTRK1 (5)

0.2

NF1 (6)

0.2

STK11 (9)

0.2

EPHB1 (7)

EGFR (9)

0.8

NTRK3 (5)

0.2

0.2

KRAS (9)

0.8

MAP3K3 (3)

0.3

ACVR1B (3)

ATM (9)

0.3

TP53 (9)

0.3

PAK4 (9)

0.31.0

1.0

0.2

0.3

1.0

0.3

MSH6 (1)

CDKN2A (3)

0.3

LRP1B (5)

0.3

GNAS (3)

0.2

ABL1 (1) NRAS (1)

0.2

0.6

0.6

PTEN (2)

0.2

0.3

0.2

0.2

0.2

0.2

0.2

0.7

0.2

IGF1R (1)

0.3 0.2

ERBB4 (1) BARD1 (1)

PRKCG (1) MDM2 (1)

BAP1 (1)0.3

1.0

PAK6 (1)

Figure S7: Multi-Dendrix results on the Lung dataset.

40

PGAP2 (1)

8q21.13(A) (5)

SAAL1 (4)

0.4

RB1 (5)

0.6

ANKRD30B (5)

0.3

12p13.33(A) (9)

0.6

PTEN(D) (3)

MYC(A) (3)

0.3

LOC440518 (3)

0.3

LRRC41 (2)

MAP3K1(D) (6)

0.2

19q13.43(A) (6)

0.2

PHLDB1 (4)

0.2

PIK3CA (6)

0.2

0.4

0.3

0.4

AEBP1 (4)

ABCC12 (5)

0.2

GATA3 (8)

0.4

TP53 (9)

0.4

C1ORF65 (3)

0.2

ECE2 (1) LOC148709 (1)

0.3

0.6

0.6

PTPRD(D) (4)

0.2

8p11.23(A) (4)

0.2

0.7

0.4

0.7 0.4

0.7

QRICH2 (4)

0.2

0.3

0.40.3

MAST1 (1)

0.4

0.6

0.6

0.9

0.2

FAM171A1 (1)

0.3

TOPBP1 (1)

0.4

0.4

0.4

Figure S8: Multi-Dendrix results on the BRCA dataset restricted to Basal-like patients.

41

CCND1(A) (1) 8q21.13(A) (1)

ENSG00000247765 (2)

AKT1 (9)

0.2

ADK-MYST4(A) (5)

0.2

MIR21(A) (3)

0.2

PIK3CA (9)

0.2

MLL3 (4)

MAP3K1 (9)

0.3

MAP2K4(D) (9)

0.3

ZNF217(A) (7)

0.3

0.6

0.3 0.4

1.0

8p11.23(A) (4)

0.2

MAP2K4 (4)

MCL1(A) (2)

0.2

CDH1 (6)

0.2

19q13.43(A) (2)

0.2

8p11(A) (6)

0.4

GATA3 (9)

0.4

C9ORF102 (2)

0.2

0.2

0.2

0.2

13q12.3(A) (1)

0.3

0.6

0.2

0.4

1.0

20p12.1(D) (2)

0.20.3

RUNX1 (2)

0.2

PIK3R1 (1)

0.4

0.2

PTPRD(D) (3)

0.3

CTCF (2)

0.2

0.2

0.3

0.2

0.2

0.4 0.2

PPEF1 (1)

0.4

0.20.3 0.2

TP53 (1)

0.4

PTEN (1)

0.2

0.2

Figure S9: Multi-Dendrix results on the BRCA dataset restricted to Luminal A patients.

GPRASP2 (1) GRHL2 (1)

CCND1(A) (9)

MAP2K4 (8)

0.9

EGFR(A) (3)

0.3

RB1 (9)

1.0

PRRX1 (6)

0.7

0.3

0.9

0.6 0.3

0.3

0.7

CDH1 (6)

CADPS (4)

0.4

GATA3 (6)

0.7

TP53 (6)

0.70.4

0.4

0.7

PYHIN1 (2)

CD163L1 (9)

0.2

MAP2K4(D) (9)

0.2

PIK3CA (9)

0.2

RPGR (2)

0.2

GUCY2C (5)

0.6

1.0

1.0

0.2

0.2

0.2 0.6

0.6

1.0

ENSG00000247765 (3)

TBX3 (3)

0.3

ZNF217(A) (3)

0.3

PTPRD(D) (2)

0.20.3

0.2

0.2

CUL4B (2)

Figure S10: Multi-Dendrix results on the BRCA dataset restricted to Luminal B patients.

42

4q13.3(A) (3)

MLL3 (3)

0.3

MIR21(A) (6)

0.3

0.3

ROR1 (6)

0.3

PIK3CA (6)

0.3

MAP2K4(D) (3)

0.3

0.7

0.3

ZFPM2 (5)

SF3B1 (9)

0.6

APOB48R (5)

0.3

ADK-MYST4(A) (4)

0.2

ERBB2(A) (9)

0.60.6 0.4

1.0

0.6 0.4

SDCCAG8 (3)

HLA-DRB6 (5)

0.3

TP53 (9)

0.3

ABCA8 (4)

0.2

0.4

0.3 STK31 (4)

0.4

CUL4B (6)

0.7

0.4 0.4

0.2

KTN1 (1)AKT1 (2) SEMA4F (1) GRIN2B (1)

PPP1R3A (1) NLRP2 (1) SAGE1 (1) ARHGAP23 (1)

RUNX1 (1) CCT6B (2) TPO (1) RTL1 (1)

CFH (1) ZNF841 (1) FOXA1 (1)COL1A2 (1)

Figure S11: Multi-Dendrix results on the BRCA dataset restricted to HER2-enriched patients.

MDN1 (2)

RB1(D) (3)

0.2

EGFR (3)

0.2

PDGFRA(A) (3)

0.2

MSL3 (3)

CDK4(A) (9)

0.3

RB1 (9)

0.3

CDKN2A(D) (9)

0.3

SUSD3 (3)

0.2

PTEN(D) (9)

PDPN,PRDM2(A) (2)

0.2

IDH1 (8)

0.9

HMCN1 (2)

0.2

PTEN (9)

1.0

PIK3CA (5)

0.60.2

1.0

1.0

0.3

LMNB2 (3)

0.31.0

0.3 0.3

0.3 0.3

0.3

0.3

PIK3R1 (1) KRTAP5-5 (2)

MDM4(A) (6)

0.2

AKAP6,NPAS3(D),NPAS3(D) (4)

0.2

TP53 (6)

0.2

MDM2(A) (6)

0.2

0.3

0.2

0.9

0.4

0.4

0.7

0.7 0.4

0.4

0.2

0.70.6

10q26.3(1)

Figure S12: Multi-Dendrix results on the unfiltered GBM mutation data.

43

MAP3K1 (2)

PTEN (2)

0.2

ERBB2(A) (2)

0.2

SVIL (1)

USH2A (1)

GPRIN2 (3)

CDH1 (9)

0.3

GATA3 (9)

0.3

TP53 (9)

0.3

CTCF (6)

0.3

SF3B1 (1)

PTEN(D) (9)

AKT1 (8)

0.9

ASXL2 (2)

0.2

GRIK2 (7)

0.8

PIK3CA (9)

1.0

ANKRD11(D) (6)

RB1(D) (5)

0.6

SYNE2 (4)

0.4

PTPRD(D) (6)

0.7

RUNX1 (2)

0.2

MLL3 (1)

0.2

0.7

0.9

PCNXL2 (1)

0.2

0.2

0.4

0.6

PAK1(A) (1)

1.0

1.0

0.7

MAP3K1(D) (1)

0.41.0

0.7

MAP2K4(D) (1)

0.7 0.8 0.2

0.2

Figure S13: Multi-Dendrix results on the unfiltered BRCA mutation data.

44

Documents

Methods for Identifying Driver Pathways in Cancercs.brown.edu/research/pubs/theses/masters/2013/leiserson.pdf · Raphael. Methods for Identifying Driver Pathways in Cancer, Poster