6
A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lage a,b,c,1 , Niclas Tue Hansen a,1 , E. Olof Karlberg a,d , Aron C. Eklund a , Francisco S. Roque a , Patricia K. Donahoe b,c,2 , Zoltan Szallasi a,c,e , Thomas Skøt Jensen a , and Søren Brunak a,f,2 a Center for Biological Sequence Analysis, Kemitorvet building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark; b Pediatric Surgical Research Laboratories, MassGeneral Hospital for Children, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114; c Harvard Medical School, 25 Shattuck Street, Boston, MA 02115; d Stockholm Bioinformatics Center, Albanova, Stockholm University, Roslagstullsbacken 35, SE-114 21 Stockholm, Sweden; e Children’s Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, 300 Longwood Avenue, Boston, MA 02115; and f Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, DK-2200 Copenhagen N, Denmark Contributed by Patricia K. Donahoe, October 28, 2008 (sent for review July 21, 2008) Heritable diseases are caused by germ-line mutations that, despite tissuewide presence, often lead to tissue-specific pathology. Here, we make a systematic analysis of the link between tissue-specific gene expression and pathological manifestations in many human diseases and cancers. Diseases were systematically mapped to tissues they affect from disease-relevant literature in PubMed to create a disease–tissue covariation matrix of high-confidence as- sociations of >1,000 diseases to 73 tissues. By retrieving >2,000 known disease genes, and generating 1,500 disease-associated protein complexes, we analyzed the differential expression of a gene or complex involved in a particular disease in the tissues affected by the disease, compared with nonaffected tissues. When this analysis is scaled to all diseases in our dataset, there is a significant tendency for disease genes and complexes to be over- expressed in the normal tissues where defects cause pathology. In contrast, cancer genes and complexes were not overexpressed in the tissues from which the tumors emanate. We specifically iden- tified a complex involved in XY sex reversal that is testis-specific and down-regulated in ovaries. We also identified complexes in Parkinson disease, cardiomyopathies, and muscular dystrophy syn- dromes that are similarly tissue specific. Our method represents a conceptual scaffold for organism-spanning analyses and reveals an extensive list of tissue-specific draft molecular pathways, both known and unexpected, that might be disrupted in disease. proteomics systems biology computational biology P athology caused by defects in human genes is usually highly tissue-specific (1–4). In heritable diseases, this suggests that specific spatiotemporal functions of the implicated genes are disrupted due to germ-line mutations. Research on tissue spec- ificity of human diseases has focused on the analysis of single- disease genes in affected tissues (5, 6), and although it has been shown that disease genes generally tend to be expressed in a limited number of tissues (2), it is still unclear in many cases how the tissue-specific expression patterns of disease genes correlate with their pathological manifestations. Proteomics approaches have established that most gene products exert their function as members of one or more protein complexes (7–11), and that mutations in different proteins participating in the same complex, such as cellular machines, rigid structures, dynamic signaling networks, and posttranslational modification systems, generally lead to similar phenotypes (8, 12, 13). A next logical step is to model entire disease complexes and to analyze the link between tissue specificity of the complexes and the pathological manifesta- tions with which they are associated when defective. However, such efforts are hampered by the lack of adequate coverage on exper- imental proteomic data in humans and of strategies for systemat- ically analyzing hundreds of diseases, and their related genes and protein complexes, across multiple tissues of the human organism. Here, we describe a strategy (Fig. 1) for systematically correlating pathological manifestations of diseases with expression patterns of implicated genes and protein complexes across many human tissues. For this analysis we created and validated a number of datasets, including 1,500 disease-associated protein complexes, and to these added tissue and subcellular localization. Then, a method for system- atically associating diseases to affected tissues was developed. Across all diseases in the Online Mendelian Inheritance in Man (OMIM) (14) database to which causative genes could be mapped, we analyze the correlation between tissue-specific expression and pathological mani- festation both at the cellular level of single-disease genes and for entire disease-associated protein complexes. Finally, we systematically com- pared the tissue-specific pattern of expression and pathology in cancer- initiating genes and complexes, causing familial cancers, with that of non-cancer disease genes and complexes. Results Systematic Generation of an Atlas of Disease-Associated Protein Complexes with Tissue Resolution. By mining the GeneCards (15) resource for genes associated with diseases, we generated a list of 2,227 unique disease-related proteins. Similar to the method that we reported earlier (13), an in silico approach for generating disease-associated protein complexes based on an inferred human protein–protein interaction network was used [see supporting in- formation (SI) Text and Fig. S1]. Following this strategy, we generated 1,524 raw complexes comprising 45,662 unique interac- tions between 5,202 unique proteins. The quality of the complexes was validated by measures identical to the ones reported in major experimental screens in Saccharomyces cerevisiae, Escherhichia coli, and Homo sapiens (7–11, 16, 17), showing that the quality of our data matches the reproducibility, average probabilistic interaction scores, accuracy, and coverage reported in these studies and that the complexes are true biological entities (see SI Text, Table S1, and Fig. S2). Finally, the complexes were mapped to tissues by using the expression data from 73 nondiseased tissues from the Novartis Research Foundation Gene Expression Database (GNF) (18). The expression level of a complex in a tissue was calculated by averaging over the expression levels of all genes represented in the complex. Author contributions: K.L., N.T.H., Z.S., T.S.J., and S.B. designed research; K.L., N.T.H., E.O.K., A.C.E., and F.S.R. performed research; K.L., N.T.H., E.O.K., A.C.E., F.S.R., P.K.D, Z.S., T.S.J., and S.B. analyzed data; and K.L. and N.T.H. wrote the paper. The authors declare no conflict of interest. Freely available online through the PNAS open access option. 1 K.L. and N.T.H. contributed equally to this work. 2 To whom correspondence may be addressed. E-mail: [email protected] or brunak@ cbs.dtu.dk. This article contains supporting information online at www.pnas.org/cgi/content/full/ 0810772105/DCSupplemental. © 2008 by The National Academy of Sciences of the USA 20870 –20875 PNAS December 30, 2008 vol. 105 no. 52 www.pnas.orgcgidoi10.1073pnas.0810772105 Downloaded by guest on August 1, 2021

A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

A large-scale analysis of tissue-specific pathology andgene expression of human disease genesand complexesKasper Lagea,b,c,1, Niclas Tue Hansena,1, E. Olof Karlberga,d, Aron C. Eklunda, Francisco S. Roquea, Patricia K. Donahoeb,c,2,Zoltan Szallasia,c,e, Thomas Skøt Jensena, and Søren Brunaka,f,2

aCenter for Biological Sequence Analysis, Kemitorvet building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark; bPediatric Surgical ResearchLaboratories, MassGeneral Hospital for Children, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114; cHarvard Medical School, 25 ShattuckStreet, Boston, MA 02115; dStockholm Bioinformatics Center, Albanova, Stockholm University, Roslagstullsbacken 35, SE-114 21 Stockholm, Sweden;eChildren’s Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, 300 Longwood Avenue, Boston, MA 02115;and fNovo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, DK-2200 Copenhagen N, Denmark

Contributed by Patricia K. Donahoe, October 28, 2008 (sent for review July 21, 2008)

Heritable diseases are caused by germ-line mutations that, despitetissuewide presence, often lead to tissue-specific pathology. Here,we make a systematic analysis of the link between tissue-specificgene expression and pathological manifestations in many humandiseases and cancers. Diseases were systematically mapped totissues they affect from disease-relevant literature in PubMed tocreate a disease–tissue covariation matrix of high-confidence as-sociations of >1,000 diseases to 73 tissues. By retrieving >2,000known disease genes, and generating 1,500 disease-associatedprotein complexes, we analyzed the differential expression of agene or complex involved in a particular disease in the tissuesaffected by the disease, compared with nonaffected tissues. Whenthis analysis is scaled to all diseases in our dataset, there is asignificant tendency for disease genes and complexes to be over-expressed in the normal tissues where defects cause pathology. Incontrast, cancer genes and complexes were not overexpressed inthe tissues from which the tumors emanate. We specifically iden-tified a complex involved in XY sex reversal that is testis-specificand down-regulated in ovaries. We also identified complexes inParkinson disease, cardiomyopathies, and muscular dystrophy syn-dromes that are similarly tissue specific. Our method represents aconceptual scaffold for organism-spanning analyses and reveals anextensive list of tissue-specific draft molecular pathways, bothknown and unexpected, that might be disrupted in disease.

proteomics � systems biology � computational biology

Pathology caused by defects in human genes is usually highlytissue-specific (1–4). In heritable diseases, this suggests that

specific spatiotemporal functions of the implicated genes aredisrupted due to germ-line mutations. Research on tissue spec-ificity of human diseases has focused on the analysis of single-disease genes in affected tissues (5, 6), and although it has beenshown that disease genes generally tend to be expressed in alimited number of tissues (2), it is still unclear in many cases howthe tissue-specific expression patterns of disease genes correlatewith their pathological manifestations.

Proteomics approaches have established that most gene productsexert their function as members of one or more protein complexes(7–11), and that mutations in different proteins participating in thesame complex, such as cellular machines, rigid structures, dynamicsignaling networks, and posttranslational modification systems,generally lead to similar phenotypes (8, 12, 13). A next logical stepis to model entire disease complexes and to analyze the link betweentissue specificity of the complexes and the pathological manifesta-tions with which they are associated when defective. However, suchefforts are hampered by the lack of adequate coverage on exper-imental proteomic data in humans and of strategies for systemat-ically analyzing hundreds of diseases, and their related genes andprotein complexes, across multiple tissues of the human organism.

Here, we describe a strategy (Fig. 1) for systematically correlatingpathological manifestations of diseases with expression patterns ofimplicated genes and protein complexes across many human tissues.For this analysis we created and validated a number of datasets,including �1,500 disease-associated protein complexes, and to theseadded tissue and subcellular localization. Then, a method for system-atically associating diseases to affected tissues was developed. Across alldiseases in the Online Mendelian Inheritance in Man (OMIM) (14)database to which causative genes could be mapped, we analyze thecorrelation between tissue-specific expression and pathological mani-festation both at the cellular level of single-disease genes and for entiredisease-associated protein complexes. Finally, we systematically com-pared the tissue-specific pattern of expression and pathology in cancer-initiating genes and complexes, causing familial cancers, with that ofnon-cancer disease genes and complexes.

ResultsSystematic Generation of an Atlas of Disease-Associated ProteinComplexes with Tissue Resolution. By mining the GeneCards (15)resource for genes associated with diseases, we generated a list of2,227 unique disease-related proteins. Similar to the method thatwe reported earlier (13), an in silico approach for generatingdisease-associated protein complexes based on an inferred humanprotein–protein interaction network was used [see supporting in-formation (SI) Text and Fig. S1]. Following this strategy, wegenerated 1,524 raw complexes comprising 45,662 unique interac-tions between 5,202 unique proteins. The quality of the complexeswas validated by measures identical to the ones reported in majorexperimental screens in Saccharomyces cerevisiae, Escherhichia coli,and Homo sapiens (7–11, 16, 17), showing that the quality of ourdata matches the reproducibility, average probabilistic interactionscores, accuracy, and coverage reported in these studies and that thecomplexes are true biological entities (see SI Text, Table S1, andFig. S2). Finally, the complexes were mapped to tissues by using theexpression data from 73 nondiseased tissues from the NovartisResearch Foundation Gene Expression Database (GNF) (18). Theexpression level of a complex in a tissue was calculated by averagingover the expression levels of all genes represented in the complex.

Author contributions: K.L., N.T.H., Z.S., T.S.J., and S.B. designed research; K.L., N.T.H., E.O.K.,A.C.E., and F.S.R. performed research; K.L., N.T.H., E.O.K., A.C.E., F.S.R., P.K.D, Z.S., T.S.J.,and S.B. analyzed data; and K.L. and N.T.H. wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.

1K.L. and N.T.H. contributed equally to this work.

2To whom correspondence may be addressed. E-mail: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0810772105/DCSupplemental.

© 2008 by The National Academy of Sciences of the USA

20870–20875 � PNAS � December 30, 2008 � vol. 105 � no. 52 www.pnas.org�cgi�doi�10.1073�pnas.0810772105

Dow

nloa

ded

by g

uest

on

Aug

ust 1

, 202

1

Page 2: A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

Mapping Complexes to Diseases. To map complexes to diseases wesystematically identified the proteins that had been associated toeach of the diseases mentioned in OMIM. This was done by usingthe protein to OMIM mapping displayed in GeneCards (http://www-bimas.cit.nih.gov/cards/) database. We then measured theoverlap between proteins in complexes and proteins associated withthe diseases and calculated the significance of this overlap.Because a number of complexes are known to be involved indifferent diseases we allowed for a complex to be associated withmore than one disease. In total the 1,524 raw complexes weremapped to 1,054 OMIM diseases. In the further text we refer tothese as disease complexes.

Disease–Tissue Association Matrix. To our knowledge no systematicmapping of diseases to affected tissues exists. We determined thecovariance of a disease with a tissue by identifying the number ofpublications comentioning the disease and tissue (and synonymsthereof), relative to the number of publications mentioning thedisease or tissue alone (19). We transformed the covariance into anassociation score between a tissue and a disease by calculating thefraction of covariance that a given tissue–disease pair constituted,of the total covariance for a given disease. Calculating an associa-tion score for the 73 tissues used in the GNF tissue atlas (18) versus1,054 OMIM diseases yielded a disease–tissue association matrix(Fig. 2). By manually validating the associations we determined a

A

B C

D E

Fig. 1. Overview of the study. (A) The different analyses and how they relate to each other. (B) 59 inherited cancers and �1,000 other Mendelian disordersare mapped to 2,227 causative genes and 1,524 complexes by using a combination of automated parsing of OMIM and PubMed. Genes and complexes arestratified into 3 major categories, noncancer disease, cancer gain of function, and cancer loss of function. This stratification is done by a combination ofmanual curation and semiautomated steps. (C) A unique set of 1,524 protein complexes associated with disease are generated by querying the proteinsof disease genes for direct interaction partners in a human protein interaction network followed by several quality control steps. (D) Transcriptionalregulation of both genes and sets of genes that work together in cellular complexes are analyzed across tissues of the human organism. (E) Diseases aremapped to relevant tissues by using association degree of particular diseases and tissues across PubMed. Steps are taken to reduce errors in wordrecognition and handle synonyms accurately. These steps are followed by determination of an optimal cutoff and rigorous quality control. Hereby, weproduced a matrix where diseases are mapped to tissues relevant to the pathology with a precision of �0.8. Cancers are mapped to tissues that are theprimary origin of tumor formation with a precision �0.95.

Lage et al. PNAS � December 30, 2008 � vol. 105 � no. 52 � 20871

MED

ICA

LSC

IEN

CES

Dow

nloa

ded

by g

uest

on

Aug

ust 1

, 202

1

Page 3: A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

cutoff where tissues associated with the pathology of a given diseasecould be determined with a precision of �80% (see SI Text, TableS2, and Fig. S3), meaning that above this threshold tissues relevantto the pathology of a given disease can be accurately identifiedamong the GNF atlas tissues in �80% of the cases. High confidenceassociations scoring above this threshold are blue to dark blue inFig. 2. Tissues associated with the pathology of a given diseases arein the further text defined as disease–tissue associations scoringabove this cutoff.

Mapping Complexes to Cancers. A large number of genes have beenassociated with cancers, due to aberrant expression or somaticmutations in tumors. However, few of these genes have actuallybeen proven to play a role in the initiation of the tumor. Hence, anautomated mapping of cancer genes to complexes would includemany genes that are mutated in tumors but do not cause the cancer.Because we are interested in studying the tissue distribution ofdisease-initiating genes and complexes, we manually created anexhaustive list of heritable cancer genes that initiate tumors throughgerm-line mutations. These genes were mapped to OMIM diseasesdescribing the cancers manually (Table S3). For this subset of genes,there is compelling evidence that defects are the primary cause ofthe cancer. In total we extracted a subset of 51 genes in whichmutations lead to heritable cancers and mapped them to 59 cancers.Because most cancer mutations are either loss or gain of functionthat could influence the mechanisms of disease progression andhave bearing on the mechanisms of tissue specificity, we furtherstratified the cancer genes into loss or gain of function as definedin Vogelstein et al. (4). Examples of loss-of-function genes aretumor suppressor or DNA repair genes that become defective whenmutated, and examples of gain of function are kinases that becomeconstitutively activated by mutations (Table S4). Cancer-associatedcomplexes were identified as complexes enriched for this subset ofgenes. In the further text we refer to these as cancer complexes.

Generating a Disease–Tissue Association Matrix for Cancers. Cancerto tissue association mapping is not straightforward. In this study wewere interested in exclusively studying the tissues in which tumorsare initiated through germ-line mutations of particular genes.Because cancers generally affect many tissues through downstreameffects such as metastases, associations to noninitiating tissues hadto be filtered out. Furthermore, many cancer syndromes, arisingfrom germ-line mutations in cancer genes, also include nonmalig-

nant pathology, for which disease–tissue association had to bedisregarded in this analysis. For this reason, we manually analyzedthe complete subset of tissues associated with heritable cancersyndromes resulting in a precision approximating 100% for thecancer–tissue associations (SI Text and Table S5).

Correlation Between Pathology and Tissue-Specific Expression. First,we analyzed the expression of disease genes in the tissue with thehighest disease association in the disease–tissue matrix (rank 1).This analysis was repeated for the 2nd to 25th highest associatedtissues (rank 2 to 25) and the average z score at each rank level wasplotted as a curve (Fig. 3A). For example, myosin heavy chain 6(MYH6) is involved in hypertrophic cardiomyopathy and the tissuesfrom the GNF atlas ranked first and second in relation to hyper-trophic cardiomyopathy are heart and cardiac myocytes. We de-termined the z score of MYH6 in heart (tissue rank 1), the averagez score of MYH6 in the 2 highest ranked tissues, heart and cardiacmyocytes (tissue rank 2). This procedure is repeated for ranks 3 to25. This gives a set of rank-dependent z scores for MYH6. Thisprocedure is repeated for every disease gene in every diseaseyielding rank-dependent z scores for every gene–disease combina-tion, which is plotted in Fig. 3A. This figure shows the clear tendencyof overexpression for disease genes in tissues with the highest rank(blue curve). The curves for cancer genes show 2 different trends.Although gain-of-function genes are overexpressed in tissues withthe highest rank (red curve), loss-of-function genes are underex-pressed (green curve).

To see whether the observed expression trends were significant,we averaged the z scores in the tissues associated with the diseaseand compared the scores with their expression levels in nonaffectedtissues (Fig. 3B). For non-cancer disease genes we observed asignificant tendency of overexpression (P � 1.0E-6), which is alsothe case for gain-of-function cancer genes (P � 3.9E-2), but withless significance. Loss-of-function cancer genes show the conversetrend of underexpression (P � 1.0E-2).

We carried out the same analysis for the protein complexes whichshowed that the expression trend observed for disease genes is con-served at the level of disease protein complexes (see Fig. 3 D and C).These disease complexes display a significant tendency to be overex-pressed in tissues where they are involved in pathology (P � 10E-6, bluecurve). While protein complexes significantly enriched for gain-of-function cancer genes follow the tendency of overexpression (P � 0.44,

BA

Fig. 2. Disease–tissue association matrix. The color range goes from light gray, which corresponds to no association of disease and tissue, to dark blue at 12%association. Only high confidence associations scoring above 8% (blue to dark blue) are used in the further analysis. The percent association is the proportionof a disease’s association to a particular tissue in the Novartis Research Foundation Gene Expression Database (GNF) atlas, out of the cumulative association toall tissue in the atlas. (A) The first 100 diseases mapped to the 73 tissues in the GNF atlas. A more detailed view of the matrix can be seen by using the zoom tool.(B) A subset of the disease–tissue associations.

20872 � www.pnas.org�cgi�doi�10.1073�pnas.0810772105 Lage et al.

Dow

nloa

ded

by g

uest

on

Aug

ust 1

, 202

1

Page 4: A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

red curve), complexes enriched for loss-of-function cancer genes areunderexpressed (P � 3.4E-3, green curve).

Because the z scores were lower for the cancer genes andcomplexes compared with the more robust values of the non-cancerdisease genes and complexes, we tested whether this result wasinfluenced by the dataset and normalization method. We replicatedthe analysis by using a different robust multiarray (RMA)-basednormalization scheme (20). Expression data normalized with thisalgorithm still showed a significant overexpression of disease genesand complexes, but both the over and underexpression trends forthe cancer genes and complexes decreased in significance. To testwhether a few diseases or tissues were driving the observed trend,we analyzed the expression trend broken down into single tissues(Fig. S4a) and by bootstrapping the dataset both on disease andtissue level. This analysis shows that most tissues and diseasescontribute to the observed results and they are robust to bootstrap-ping of the dataset (Fig. S4b).

Examples of Disease Complexes with Tissue and Phenotype Correla-tion. Examples of the correlations found between tissue expressionand pathology or phenotype reported are provided in Fig. 4. Also,the most significant gene ontology (GO) subcelluar and functionalcategories for the complex in question are indicated followed by thesignificance with which the complex can be assigned to this GOcategory (21). Tissue names are as defined in the GNF atlas. Thefull sets of proteins in each complex can be seen in Fig. S5.

XY sex reversal can be caused by mutations in the transcriptionfactors SRY (Sex determining Region Y) (22), SOX 9 (the SRY sexdetermining region Y-box 9 gene) (23), NR5A1 (the nuclearreceptor subfamily 5A1), more commonly known as SF1 (24, 25);and NR0B1 (nuclear receptor subfamily 0B1), more commonlyknown as DAX1 (26). Additionally, SOX 9 is associated withcampomelic dysplasia, a bone disorder that leads to a number ofassociated skeletal and cartilaginous deformities (27). SF1 isneeded for gonad and adrenal differentiation (25, 28) and forproper steroidogenesis as well as for Mullerian Inhibiting Substance(MIS) ligand and MIS receptor expression (28, 29). DAX1, whichleads to XY sex reversal both when overexpressed, by inhibiting SF1(26), and when inactivated, as it is required for testis differentiationby regulating expression of SOX9 (30). Although the activity ofSF1, DAX1, and SOX9 is required for testis differentiation, devel-opment, and maintenance, none of these genes are essential forovarian development and maintenence (30–33). Here, we identifya transcriptional regulation complex (GO:0006355: P � 1.9E-8)containing DAX1, SF1, and SOX9, all of which are known to beassociated with sex reversal (P � 6.9E-6). Furthermore, the com-plex contains SOX8 that is closely related to SOX9 and implicatedin regulating the expression of testis-specific genes (34). Whereasthe complex is overexpressed in testis cells, it is underexpressed inovaries (Fig. 4), which coincides with the known biology of the mostwell characterized of its components. Our method thus has predic-tive value because it can (i) detect interactions between moleculesthat, by themselves, are known to be important in sex differentiationand determination by producing sex reversal, (ii) validate thesefindings by demonstrating dimorphic tissue-specific expression thatcorrelates with the pathology resulting from inactivation of severalmembers of the complex, and (iii) reveal the importance of newinteractors worthy of further study.

Four other complexes, where tissue-specific overexpression corre-lates with pathological manifestations, are depicted in Fig. 4 (see SI Textand Fig. S5 for more details on these 4 complexes and for examples ofcancer-related complexes). These include (i) a complex involved inCharcot–Marie–Tooth disease type 4F and overexpressed in spinalcord, dorsal root ganglion, and skeletal muscles; (ii) a sarcoglycancomplex involved in Limb–girdle muscular dystropy overexpressed inskeletal muscle, cardiac myocytes, and heart; (iii) a myofibril complexinvolved in familial cardiomyopathy overexpressed in several tissuesassociated with the disease such as heart and cardiac myocytes; (iv) anda complex involved in catechol metabolism and Parkinson disease,overexpressed in a number of relevant brain tissues including thecaudate nucleus, subthalamic nucleus, and globus pallidus. Althoughthe overexpression of the sarcoglycan and myofibril complex in muscletissues is well known, the ovarian–testes dimorphic expression patternof the sex-reversal complex, and the overexpression of a Parkinsoncomplex in several relevant brain tissues of the basal ganglia aresuggestive of the underlying tissue-specific biology of these disorders.Across all examples the tissue-specific expression patterns correlatewith the pathological changes observed when one or several membersof the complex are defective.

DiscussionThe complex dataset reported here is �3 times larger than ourreported set of complexes (13) and contains approximately 7 timesmore interactions than the only reported experimental screen forhuman complexes (35). To our knowledge, this dataset comprisesa unique set of systematically generated complexes with tissue,

BA

C D

Fig. 3. Expression levels of disease genes and complexes in pathologicallyassociated tissues. (A) The expression level of genes associated with diseases andcancers in the tissues most associated with the particular disease caused by thegenes. Tissues are ranked with the most associated tissue at the intersection withthe y axis and in declining order from left to right. This plot shows the trend ofoverexpression for disease genes and gain-of-function cancer genes in tissueswith the highest rank. Loss-of-function cancer genes are generally underex-pressed in the tissues with the highest rank. (B) The average disease geneexpression inassociatedtissues is shown.Diseasegenesareoverexpressedwithanaverage z score of 0.28 (P � 10E-6). The cancer-associated genes show 2 differenttrends: gain-of-function genes follow the trend of all disease genes, with anaverage z score of 0.30 (P � 3.9E-2), but loss-of-function genes have a tendencyto be underexpressed in the tissues associated with tumor formation, with anaverage z score of �0.21 (P � 1.0e-2). (C and D) The same analysis is shown at thelevel of protein complexes, where the trend is conserved.

Lage et al. PNAS � December 30, 2008 � vol. 105 � no. 52 � 20873

MED

ICA

LSC

IEN

CES

Dow

nloa

ded

by g

uest

on

Aug

ust 1

, 202

1

Page 5: A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

phenotype, and subcellular resolution in any mammalian organism.The entire atlas is made available online at (http://www.cbs.dtu.dk/suppl/dgf/).

A theoretical limitation of our approach is that we use geneexpression data to map complexes to tissues because of the lack ofgood coverage of quantitative proteomics expression data. Earlystudies of the relationships between mRNA expression and proteinabundance levels have consistently reported modest correlations(36–38). Recent work, which uses a probabilistic framework tomodel the relationship between the experimentally recorded pro-tein and mRNA patterns, has confirmed that in 75% of all genestissue mRNA expression patterns linearly correlate with proteinabundance, and this overall good correlation is shown for thedataset we use in this work (39). However, to test how a lack ofcorrelation for 25% of the genes affects our results, we randomized25% of the data points and found that the results achieved fordisease genes and complexes, and for loss-of-function cancer genesand complexes, were robust (P � 1.0E-3, see SI Text). Furthermore,the tissue resolution of our complexes is supported by the obser-vation that they are significantly enriched in proteins cooccurring intissue samples that are analyzed by using manually curated immu-nohistochemistry data (SI Text and Fig. S2).

Our results support the notion that known disease genes gener-ally are tissue specific (1, 2), by being selectively overexpressed inthe tissues in which specific gene defects cause pathology. Alter-natively high levels of gene expression may be needed for thefunctional activity of the tissue. Moreover, we show that this trendis conserved also at the level of the protein complexes in which thedisease genes carry out their biological function.

Most known genes that initiate cancer are involved in ubiquitousprocesses such as DNA repair, cell cycle regulation, and apoptosis(4, 40–42). And it remains a key puzzle in oncology to determinehow germ-line mutations in general genes initiate tissue-specifictumors (40). To investigate this contradiction, we also analyzed theexpression patterns of cancer genes and complexes involved inheritable cancer syndromes. The gain-of-function cancer genes andcomplexes follow the trend of non-cancer disease genes and are

generally overexpressed in tissues where they initiate tumors,conversely complexes enriched for loss-of-function genes are un-derexpressed in the tissues where mutations cause neoplastic trans-formation. Our results for cancer genes and complexes were notrobust when different algorithms were used to normalize theexpression data. There could be a number of reasons for the lackof a tissue-specificity signal for the analyzed cancer genes andcomplexes. The current concepts of cancer indicate that sometumors are initiated by a small subset of stem cells (43) whosespecific expression levels would be impossible to detect in tissuesamples with the resolution used here. Another hypothesis is thattumor initiation is caused by a combination of mutations in a keygene, exposure to mutagenic substances or ionizing radiation, andhigh proliferation rates of specific cell populations in a tissue (40),a combination we do not analyze here. However, our resultshighlight the fundamental difference between the tissue specificityof cancers and other diseases, and shows that this difference isconsistent on both gene and complex level.

Functional genomics and sequencing have been extremely usefultools for identifying the complete sets of genes in humans andmodel organisms, and deducing how disruption of different genesin a common molecular pathway can lead to similar phenotypicpathologies. These results indicate how the function of genes isorganized in space and time. The next step is to analyze entiresystems that are significantly associated with human diseases. Thishas proven difficult in humans because of experimental limitationsand ethical issues, suggesting that other strategies must be consid-ered. Using data integration and systems biology we take a steptoward this goal by integrating and refining existing data, and bycreating new data sets. Hereby we identify a comprehensive list offunctional modules that are associated with pathological processesin humans. We analyze their spatial tissue-specific and subcellularpatterns and correlate this information with the diseases that arethe result of defects in the modules. As such, our dataset and thescaffold of the analysis presented could be useful in disease systemsbiology of humans, and provides draft mechanistic pathways thatcan serve as potential molecular drug targets.

Fig. 4. Representative examples of disease complexes are displayed. Diseases are associated with tissues by using our disease–tissue matrix, and expression data are fromthe GNF dataset. The expression levels of complexes are shown as z scores. If a disease is associated with more than 3 tissues, only the 3 most associated tissues are shown forclarity. In a given complex, proteins relevant to the disease in question are yellow. The figure shows the general tendency of overexpression of the complexes in the tissuesin which they are involved in pathology compared with their expression level in other tissues. All members of the complexes can be seen in Fig. S5.

20874 � www.pnas.org�cgi�doi�10.1073�pnas.0810772105 Lage et al.

Dow

nloa

ded

by g

uest

on

Aug

ust 1

, 202

1

Page 6: A large-scale analysis of tissue-specific pathology and ...A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes Kasper Lagea,b,c,1,

Materials and MethodsMapping Genes and Complexes to Tissues. We used the GNF tissue atlas (18) thatincludes reproduced RNA expression experiments from 79 human tissues. Sixtissues were removed because they were derived from cancer tissues. We chosethe GNF dataset because it displays high reproducibility (44), and the transcriptlevels show generally a linear relationship with protein abundance (39). Welog-transformed hybridization levels and normalized within each tissue (to en-sure equal weight), followed by a normalization across all tissues, thereby ensur-ing that expression levels represented the relative presence of a transcript in onetissue compared with the other 72 healthy tissues in the dataset. For complexes,thenormalizedexpression levelsofallgenes inacomplexwereaveragedforeachtissue. To test the effect of different normalization methods on our results, weprepared the same dataset with Eklund and Szallasi’s normalization method (20)and compared the results.

A Curated Set of Genes in Which Mutations Lead to Tumor Formation. Wecurated a set of genes in which mutations had been shown to lead to heritabletumor formation and mapped them to OMIM diseases (see Table S3). By follow-ing the definitions introduced by Vogelstein et al. (4) we also noted whether thegenes were oncogenes or nononcogenes (such as tumor suppressors or DNArepair proteins) (see Table S4).

Mapping of Complexes to OMIM Diseases. We calculated the enrichment ofproteins involved in the same OMIM disease by using the annotations in Gen-eCards, which has previously been shown to be an accurate way of mappinggenes to diseases (13). We calculated the significance of an enrichment by usinga hypergeometric test.

Under- and Overexpression Significance. We averaged the expression z scoreover all disease genes in the most disease-associated tissue as determined fromthe disease–tissue matrix. For each rank from 1 through 25, we calculated the

average z score yielding a curve. In Fig. 3A, this curve is plotted as the average zscores of all gene–disease pairs in tissues with a particular rank. This procedurewas repeated for gain-of-function and loss-of-function cancer genes. Again thisapproach was repeated on a protein complex level. All reported significances are2-tailed using the Student’s t test.

Disease-Tissue Association Matrix. To identify the tissues most affected bydiseases described in the OMIM database (14), we used comentioning of a givendisease with a given tissue across PubMed (19). The tissue names from theNovartis Research Foundation Gene Expression Database (GNF) (18) were man-ually curated and translated to corresponding medical subject heading (MeSH)terms (to reduce errors in word recognition and handle synonyms properly).Similarly, the disease names were determined by using disease titles provided inOMIM. Also, these titles were manually curated and translated to the relevantMeSH terms. We used Ochiai’s coefficient (OC) as a measure of similarity derivedfrom the cooccurrences (45–47), and calculated an association score (see below),as the percentage of the total normalized cooccurrence of a given disease thatcould be attributed to a given tissue. Validation was carried out as described inSI Text.

OC�kD, kT� � �nDT2

nD�nTAS�kD, kT� � 100

OC�kD, kT��i

OC�kD, kTi�

ACKNOWLEDGMENTS. We thank Matthias Mann, Jiri Bartek, Gert-Jan B. vanOmmen, Barbara Pober, and Jonathan Rosand for valuable input on the manu-script and project, Anders Lendager and Lene Hep from MAPT for help with thefigures, Zenia Marian Størling for assistance wtih the initial analyses, KasperFugger and Christopher Workman for helpful discussions, and Olga Rigina forcurating the PPI databases. This work was supported by Villum Kann RasmussenFoundation, the Simon Spies Foundation, National Institute of Child Health andDevelopment Grant CD RO1 HD0551-50, and the National Institutes of Health

1. Winter EE, Goodstadt L, Ponting CP (2004) Elevated rates of protein secretion, evolu-tion, and disease among tissue-specific genes. Genome Res 14:54–61.

2. Goh KI, et al. (2007) The human disease network. Proc Natl Acad Sci USA 104:8685–8690.

3. Chao EC, Lipkin SM (2006) Molecular models for the tissue specificity of DNA mismatchrepair-deficient carcinogenesis. Nucleic Acids Res 34:840–852.

4. Vogelstein B, Lane D, Levine AJ (2000) Surfing the p53 network. Nature 408:307–310.5. Beyer K, et al. (2008) Identification and characterization of a new alpha-synuclein

isoform and its role in Lewy body diseases Neurogenetics 9:5–23.6. Kim KY, Kee MK, Chong SA, Nam MJ (2007) Galanin is up-regulated in colon adeno-

carcinoma. Cancer Epidemiol Biomarkers Prev 16:2373–2378.7. Gavin AC, et al. (2002) Functional organization of the yeast proteome by systematic

analysis of protein complexes. Nature 415:141–147.8. Gavin AC, et al. (2006) Proteome survey reveals modularity of the yeast cell machinery.

Nature 440:631–636.9. Krogan NJ, et al. (2006) Global landscape of protein complexes in the yeast Saccharo-

myces cerevisiae. Nature 440:637–643.10. Ho Y, et al. (2002) Systematic identification of protein complexes in Saccharomyces

cerevisiae by mass spectrometry. Nature 415:180–183.11. Butland G, et al. (2005) Interaction network containing conserved and essential protein

complexes in Escherichia coli. Nature 433:531–537.12. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA (2006) A text-mining

analysis of the human phenome. Eur J Hum Genet 14:535–542.13. Lage K, et al. (2007) A human phenome-interactome network of protein complexes

implicated in genetic disorders. Nat Biotechnol 25:309–316.14. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian

Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.Nucleic Acids Res 33:D514–D517.

15. Safran M, et al. (2002) GeneCards 2002: Towards a complete, object-oriented, humangene compendium. Bioinformatics 18:1542–1543.

16. Rual JF, et al. (2005) Towards a proteome-scale map of the human protein-proteininteraction network Nature 437:1173–1178.

17. Stelzl U, et al. (2005) A human protein-protein interaction network: a resource forannotating the proteome. Cell 122:957–968.

18. Su AI, et al. (2004) A gene atlas of the mouse and human protein-encoding transcrip-tomes. Proc Natl Acad Sci USA 101:6062–6067.

19. Korbel JO, et al. (2005) Systematic association of genes to phenotypes by genome andliterature mining. PLoS Biol 3:e134.

20. Eklund AC, Szallasi Z (2008) Correction of technical bias in clinical microarray dataimproves concordance with known biological information. Genome Biol 9:R26.

21. Camon E, et al. (2004) The Gene Ontology Annotation (GOA) Database: Sharingknowledge in Uniprot with Gene Ontology. Nucleic Acids Res 32(Database is-sue):D262–D266.

22. Polanco JC, Koopman P (2007) Sry and the hesitant beginnings of male development.Dev Biol 302:13–24.

23. Patel M, et al. (2001) Primate DAX1, SRY, and SOX9: evolutionary stratification ofsex-determination pathway. Am J Hum Genet 68:275–280.

24. Parker KL (1998) The roles of steroidogenic factor 1 in endocrine development andfunction. Mol Cell Endocrinol 140:59–63.

25. Park SY, Tong M, Jameson JL (2007) Distinct roles for steroidogenic factor 1 and deserthedgehog pathways in fetal and adult Leydig cell development. Endocrinology148:3704–3710.

26. Swain A, Narvaez V, Burgoyne P, Camerino G, Lovell-Badge R (1998) Dax1 antagonizesSry action in mammalian sex determination. Nature 391:761–767.

27. Pop R, Zaragoza MV, Gaudette M, Dohrmann U, Scherer G (2005) A homozygousnonsense mutation in SOX9 in the dominant disorder campomelic dysplasia: A case ofmitotic gene conversion. Hum Genet 117:43–53.

28. MacLaughlin DT, Donahoe PK (2004) Sex determination and differentiation. N EnglJ Med 350:367–378.

29. Shen WH, Moore CC, Ikeda Y, Parker KL, Ingraham HA (1994) Nuclear receptorsteroidogenic factor 1 regulates the mullerian inhibiting substance gene: a link to thesex determination cascade. Cell 77:651–661.

30. Meeks JJ, Weiss J, Jameson JL (2003) Dax1 is required for testis determination. NatGenet 34:32–33.

31. Notarnicola C, Malki S, Berta P, Poulat F, Boizet-Bonhoure B (2006) Transient expressionof SOX9 protein during follicular development in the adult mouse ovary. Gene ExprPatterns 6:695–702.

32. Bouma GJ, Washburn LL, Albrecht KH, Eicher EM (2007) Correct dosage of Fog2 andGata4 transcription factors is critical for fetal testis development in mice Proc Natl AcadSci USA 104:14994–14999.

33. Biason-Lauber A, Schoenle EJ (2000) Apparently normal ovarian differentiation in aprepubertal girl with transcriptionally inactive steroidogenic factor 1 (NR5A1/SF-1) andadrenocortical insufficiency. Am J Hum Genet 67:1563–1568.

34. Schepers G, Wilson M, Wilhelm D, Koopman P (2003) SOX8 is expressed during testisdifferentiation in mice and synergizes with SF1 to activate the Amh promoter in vitro.J Biol Chem 278:28101–28108.

35. Ewing RM, et al. (2007) Large-scale mapping of human protein-protein interactions bymass spectrometry. Mol Syst Biol 3:89.

36. Mootha VK, et al. (2003) Integrated analysis of protein composition, tissue diversity,and gene regulation in mouse mitochondria. Cell 115:629–640.

37. Griffin TJ, et al. (2002) Complementary profiling of gene expression at the transcrip-tome and proteome levels in Saccharomyces cerevisiae. Mol Cell Proteomics 1:323–333.

38. Le Roch KG, et al. (2004) Global analysis of transcript and protein levels across thePlasmodium falciparum life cycle. Genome Res 14:2308–2318.

39. Kislinger T, et al. (2006) Global survey of organ and organelle protein expression inmouse: Combined proteomic and transcriptomic profiling. Cell 125:173–186.

40. David SS, O’Shea VL, Kundu S (2007) Base-excision repair of oxidative DNA damage.Nature 447:941–950.

41. Petrocca F, et al. (2006) Alterations of the tumor suppressor gene ARLTS1 in ovariancancer. Cancer Res 66:10287–10291.

42. Falck J, Mailand N, Syljuasen RG, Bartek J, Lukas J (2001) The ATM-Chk2-Cdc25Acheckpoint pathway guards against radioresistant DNA synthesis. Nature 410:842–847.

43. Singh SK, et al. (2004) Identification of human brain tumour initiating cells. Nature432:396–401.

44. Huminiecki L, Lloyd AT, Wolfe KH (2003) Congruence of tissue expression profiles fromGene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics 4:31.

45. Ochiai A (1957) Zoogeographic studies on the soleoid fishes found in Japan and itsneighbouring regions. Bull Jpn Soc Sci Fish 22:526–530.

46. Jackson DA, Somers KM, Harvey HH (1989) Similarity measures: Measures of co-occurrence and association or simply measures of co-occurrence? Am Nat 133:436–453.

47. Udoh E, Rhoades J (2006) in Third International Conference on Information Technol-ogy: New Generations (IEEE Computer Society, Washington, DC), pp 490–494.

Lage et al. PNAS � December 30, 2008 � vol. 105 � no. 52 � 20875

MED

ICA

LSC

IEN

CES

Dow

nloa

ded

by g

uest

on

Aug

ust 1

, 202

1