Upload
timothy
View
212
Download
0
Embed Size (px)
Citation preview
Exploring the Genetics Underlying Autoimmune Diseases with NetworkAnalysis and Link Prediction *
Gregorio Alanis-Lobato, Carlo V. Cannistraci and Timothy Ravasi
Abstract— Ever since the first Genome Wide AssociationStudy (GWAS) was carried out we have seen an importantnumber of discoveries of biological and clinical relevance.However, there are some scientists that consider that theseresearch outcomes and their utility are far from what wasexpected from this experimental design. We instead believethat the thousands of genetic variants associated with complexdisorders by means of GWASs are an extremely valuable sourceof information that needs to be mined in a different way. Basedon this philosophy, we followed a holistic perspective to analyzeGWAS data and explored the structural properties of thenetwork representation of one of these datasets with the aim toadvance our understanding of the genetic intricacies underlyingautoimmune human diseases. The simplicity, computationalefficiency and precision of the tools proposed in this paperrepresent a new means to address GWAS data and contributeto the better exploitation of these rich sources of information.
I. INTRODUCTION
Despite the thousands of genetic variants associated with
human complex traits since the first Genome Wide As-
sociation Study (GWAS) [1], the independent contribution
of variants to genetic risk has failed to fully explain the
heritability of human diseases [2], [3]. It has been long
recognized that bio-process regulation is possible only thanks
to the relationships between genes and the proteins they
produce [3], which advocates the analysis of GWAS data
from a network perspective.
Day by day, more comprehensive and complex biological
datasets become available and prompt us to make use of
smart strategies to quickly and accurately analyse these
valuable sources of information. One such a strategy, the
emerging field of Network Medicine, not only allows for
the systematic study of the molecular intricacies underlying
complex traits but it also serves as a proxy to identify the
relationships between apparently different phenotypes [4].
While metabolic, protein interaction, virus-host and regu-
latory networks are important to understand the interdepen-
dencies between molecular components in the cell [5], [4],
genetic interaction networks (or epistatic networks) represent
abstract relationships between genes associated with genetic
buffering. It is thought that these networks can be the key
*This work was supported by King Abdullah University of Science andTechnology (KAUST)
G. Alanis Lobato, C. V. Cannistraci and T. Ravasi are with theDepartment of Bioscience and the Department of Computer Science,KAUST, 4700 KAUST, 23955-6900, Thuwal, Saudi Arabia (phone: +966-564-75-14-14, email: [email protected]; email: [email protected]; email: [email protected])
C. V. Cannistraci is with the Biomedical Cybernetics Group, Biotech-nology Center, Technische Universitat Dresden, Tatzberg 47-51, 01307,Dresden, Germany
to uncover the missing heritability that single gene effects
cannot explain [3]. Unfortunately, although well-established
techniques exist to detect gene interactions in model or-
ganisms [6], their determination in human is still work-in-
progress.
In 2007, Goh and colleagues [7] proposed a work around
this shortcoming and built a human disease gene network by
means of a method that we here exploit to construct an Au-
toimmune Disease Network (ADN, see Sec. II). We decided
to focus on autoimmune disorders due to their relevance in
terms of people affected, direct health care costs and research
funding when compared to, for example, cancer (23.5 million
vs 9 million Americans, 100 billion vs 57 billion USD and
591 million vs 6.1 million USD, respectively) [8]. The basic
topological properties of the ADN, the Gene Ontology (GO)
and expression similarities of its constituting genes and the
modules that they form are studied in Secs. II and III. In
Sec. IV we introduce the weighted version of link predictors
that we proposed in recent publications and benchmark them
against classical prediction techniques to later choose the
best approach and analyse the biological implications of its
top-scored candidate interactions. Our concluding remarks
appear in Sec. V.
II. NETWORK REPRESENTATION OF GWAS DATA
A. Autoimmune Disease Network
Based on the list of autoimmune diseases curated by the
American Autoimmune Related Disease Association [8], we
extracted genes associated with autoimmune disorders from
the GWAS Catalog [9] and constructed a bipartite network
of diseases and their associated genes (Fig. 1, left panel).
We then projected this network to its one-mode form, where
a pair of genes is linked with a weighted edge indicating
the number of diseases they are both associated with (Fig. 1,
right panel). We refer to this undirected, weighted network
as the ADN.
B. Structural Properties of the ADN
Most real networks, representing dynamic and hetero-
geneous systems, present the small world property (high
clustering coefficient and low average shortest path length
when compared to Erdos-Renyi (ER) random graphs) [10],
have heavy-tailed node degree distributions (which indicate
self-organization and presence of hubs, i.e. highly connected
entities) [11] and their structure follows the local community
paradigm or LCP (common neighbours of connected pairs
of nodes are also connected) [12]. We compared the small-
world features of 1000 ER graphs with the same number of
2014 Middle East Conference on Biomedical Engineering (MECBME)February 17-20, 2014, Hilton Hotel, Doha, Qatar
978-1-4799-4799-7/14/$31.00 ©2014 IEEE 167
A
B
C
D
a
b
c
d
e
f
a
b
c
d
e
f
1 2
2
Diseases Genes
Bipartite network Weighted one-mode projection ofthe bipartite network
11
11
1
Fig. 1. One-mode projection of the bipartite network of genes and diseases.
0 100 200 300 400 500 6000
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
−3
P(k
)
Node degree (k)0 100 200 300 400 500
0
20
40
60
80
100
120
140
160
180
200
Number of common neighbors
√ Li
nks
betw
een
com
mon
nei
ghbo
rs
LCP Decomposition PlotNode Degree Distribution
Fig. 2. The node degree distribution and LCP decomposition plot of theADN.
nodes and edges (840 and 49,485 respectively) as the ADN
and found that the latter is far from being random: it has a
clustering coefficient of 0.902 (ER graph = 0.14±0.007) and
an average shortest-path of 2.009 (ER graph = 1.86±0.006),
i.e. it does present the small world property.
Although the node degree distribution of the ADN does
not decay as a power law, it presents a right heavy tail
that indicates the presence of network hubs (Fig. 2, left
panel). Furthermore, the correlation between the number
of common neighbors of connected genes in the network
and the square root of the number of connections between
them (LCP-correlation) is very high (0.97) as depicted by
the LCP Decomposition Plot (Fig. 2, right panel). The
very high values of clustering (quite common in one-mode
projections of bipartite networks [13]) and LCP-correlation
indicate that the network is organized in tightly connected
communities of genes and we can thus take advantage of
this to discover disease modules and perform neighborhood-
based link predicition (more details in Secs. III and IV).
III. COMMUNITY DETECTION
After the application of a community detection algorithm
based on Modularity Optimization (maximization of com-
munity quality by finding groups of nodes that are densely
connected internally and that, at the same time, are connected
between them by only a few edges) [14], we found that genes
associated with the same or with related diseases clustered
together (Fig. 3).
A. Biological Properties of the Detected Communities
The majority of the detected communities are part of the
main component of the ADN and contain genes associated
with related diseases. Nevertheless, a couple of modules
formed by genes associated with a particular disease are
also present (Fig 3), as it is the case for the Restless
Legs Syndrome or Dilated Cardiomyopathy communities.
GO Semantic Similarity [16] and Gene CoExpression [17]
computation for each community and the complete network,
shows large values of the former and low of the latter (see
bars next to each community in Fig. 3), which indicates that
the analyzed pairs of genetic variants can be expressed in, for
example, different tissues and yet have important phenotypic
impact. As mentioned above, community detection grouped
genes associated with related diseases (e.g. the green commu-
nity of genes associated with inflammatory disorders) but this
was not always the case, like in the pink and grass colored
communities (Fig. 3). The first one clusters genes associated
with Rheumatoid Arthritis (RA) and Celiac Disease (CD),
two disorders with two very different phenotypes: chronic
joint inflammation and intestine inflammation respectively
but that can nevertheless show some common features: HLA
molecule association, T-cell infiltration in target organs and
high degree of comorbidity [18]. The diseases represented
in the second one, Type I Diabetes (TID) (autoimmune
destruction of insulin-producing β-cells) and Vitiligo (skin
depigmentation due to melanocyte malfunction), are also
phenotypically different but it is known that patients with
TID tend to present skin pigmentation problems, especially
in areas of insulin injection [19].
IV. LINK PREDICTION
We assessed the performance of multiple topological link
predictors (indices that assign likelihood scores to non-
adjacent pairs of nodes based on the network topology)
when applied to the communities that are not completely
connected (namely the blue, pink, green, grass and magenta
modules). This evaluation is done by removing edges from
these networks and assessing the ability of the predictors
to rediscover them up to the point where the network loses
connectivity.
A. Weighted Link Predictors
As mentioned in Sec. II, the ADN is a weighted network
and because of this, the weighted version of the classical
prediction techniques [20] was considered in this paper
(see Table I, Eqs. 1-3). Besides, we introduce the weighted
version of two techniques that we proposed in recent publi-
cations for the prediction of epistatic interactions [21] (Table
I, Eq. 4) and links in general (Table I, Eq. 5) [12].
B. Prune-rediscover Performance Evaluation
Fig. 4 depicts the very good performance of all indices
and highlights the sensitivity of wCAR to the loss of the
local community structure of the network: as links are
removed from the topology, local communities disappear and
wCAR loses precision. The power of neighborhood-based
168
Restless legssyndrome
Dilated cardiomyopathy
Grave’s disease
Endometriosis
Primary biliary cirrhosis
Eosinophilicesophagitis (pediatric)
Type I diabetes, Vitiligo
AIDS,Hypothyroidism,
Psoriasis
Multiple sclerosis
Ankylosing spondylitis,Crohn’s disease,
Primary sclerosing cholangitis,Ulcerative colitis
Celiac disease,Rheumatoid arthritis
Diabetic retinopathy,IgA nephropathy,
Kawasaki disease,Systemic lupus erythematosus
Avg. Gene Ontology Similarity (Wang Index)
Avg. Gene CoExpression (Pearson correlation)
0.42
0.10
0.60
0.06
0.68
0.07
0.63
0.06
0.62
0.07
0.65
0.11
0.62
0.07
0.50
0.03
0.75
0.06
0.82
0.12
0.74
0.05
0.59
0.07
0.64
0.07
Disease Network
Fig. 3. The ADN colored according to the communities detected by Modularity Optimization. This network visualization was done with Gephi 0.8.2 [15]
TABLE I
WEIGHTED LINK PREDICTORS
Predictor Formulation Eq. Number
Common Neighbors (wCN) wCN(x, y) =∑
z∈Γ(x)∩Γ(y)w(x,z)+w(y,z)
21
Preferential Attachment (wPA) wPA(x, y) = k′x × k′y 2
Adamic and Adar (wAA) wAA(x, y) = wCN(x, y)× 1
log(∑
z′∈Γ(z) w(z′,z))
3
Adjusted Czekanowski-Dice Dissimilarity (wACDD) wACDD(x, y) =2wCN(x,y)
k′x+k′
y+λx+λy4
Cannistraci-Alanis-Ravasi Index (wCAR) wCAR(x, y) = wCN(x, y)× wLCL 5
The penalization parameters in wACDD correspond to λx = max(0, navg − k′x), navg is the weighted average node degree of the network, k′x is theweighted degree of node x and wLCL is the sum of the weights of the links connecting the common neighbors of x and y (local community links)
approaches is expected, given the high values of cluster-
ing and LCP-correlation of the network. These structural
characteristics explain why wPA is unable to mimic the
performance of the rest: simple weighted degree products
are not enough for prediction in a network dominated by
gene pairs that share lots of neighbors.
Although wCN, wAA and wACDD present very simi-
lar performance patterns, the latter outperforms the other
techniques in most of the disease modules. We analysed
the biological relevance of the top 3 candidate interactions
advocated by this approach for each gene communities (see
next Sec.).
C. Biological Interpretation of Best Candidate Interactions
The biological analysis that led to the conclusions made
in this section, was performed with DAVID Bioinformatics
Resources 6.7 [22].
The top candidate links in the blue community correspond
to interactions of genes AFF1, AGXT2L1 and AG1P7 with
INSR. The products of these genes are involved in proper
blood filtering and toxin secretion. Thus, their simultaneous
mutations may explain the blood filtering problems in IgA
Nephropathy or the affection of blood vessels in Kawasaki
Disease or Lupus. The top candidates in the pink com-
munity are links between HLA-DQA2, HLA-DQB1 (T-cell
multiplication) and LPP (cell adhesion and motility), which
could explain chronic inflammation, and the link between
C1QTNF6 (regulation of immune cells) and IL2 (regulation
of Th17 cells), which could explain the high levels of Th17
cells in the affected tissues of RA and CD patients. The
top candidates edges in the green community are TMEM17-
COG6, COG6-B3GNT2 and B3GNT2-PARK7. Genetic vari-
ants in TMEM17, COG6 and B3GNT2 would directly impact
protein transport and their interaction with PARK7 could
produce aberrations in the proper protection against cell
stress. For the grass community, the top candidate links
are the interactions of ADPRH, ATP8B1 and CASP7 with
C14orf181. The latter is an uncharacterized gene that has
169
05001000150020002500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Remaining links
Avg
. Pre
cisi
on
wCN wAA wPA wACDD wCAR Random
010002000300040005000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Remaining links
020004000600080001000012000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Remaining links
05001000150020002500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Remaining links
0200400600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Remaining links
Diabetic retinopathy, IgA nephropathy,
Kawasaki disease,Systemic lupus erythematosus
Celiac disease,Rheumatoid arthritis
Ankylosing spondylitis,Crohn’s disease,
Primary sclerosing cholangitis,Ulcerative colitis
Type I diabetes, Vitiligo
AIDS,Hypothyroidism,
Psoriasis
Fig. 4. Performance evaluation of classical and proposed weighted link predictors.
been recently removed from several databases due to the
presence of a more interesting gene in the same chromosome.
However, not only was it part of our top 3 candidates but
also of the top 10. Given the association of ATP8B1 with
bile disorders and the high expression of CASP7 in liver,
it would be interesting to put more attention to C14orf181.
Finally, the set of top candidate interactions for the magenta
community are CAST-CCR5, CCR5-ANXA6 and ANXA6-
CCRL2, all involved in Calcium (Ca) regulation and intra-
cellular mobilization. This means that mutations in this set
of genes may impact signaling by Ca ions and explain Ca-
associated problems in AIDS or Psoriasis. Ca-regulation by
the thyroid gland could also be affected.
V. CONCLUSIONS
We analyzed GWAS data from a systems perspective,
which deems autoimmune disease genes not as isolated
entities but as important parts of a whole. The value of
the disease modules detected by Modularity Optimization
and the precision of the link predictors introduced, indicate
that they are powerful and trustworthy tools. We encourage
researchers to integrate Systems Biology and Network Sci-
ence into their projects because the knowledge generated by
these scientific devices could help us untangle the intricate
relationships between genotype and phenotype.
ACKNOWLEDGMENT
The authors thank all members of the Integrative Systems
Biology Lab at KAUST for their support, especially Maria
Contreras for proofreading the manuscript.
REFERENCES
[1] The Wellcome Trust Case Control Consortium, “Genome-wide asso-ciation study of 14,000 cases of seven common diseases and 3,000shared controls.” Nature, vol. 447, no. 7145, pp. 661–78, 2007.
[2] J. S. Bloom, I. M. Ehrenreich, W. T. Loo, T.-L. V. o. Lite, andL. Kruglyak, “Finding the sources of missing heritability in a yeastcross.” Nature, vol. 494, no. 7436, pp. 234–237, 2013.
[3] B. A. McKinney and N. M. Pajewski, “Six Degrees of Epistasis:Statistical Network Models for GWAS.” Frontiers in Genetics, vol. 2,no. January, pp. 1–6, 2011.
[4] A.-L. Barabasi, N. Gulbahce, and J. Loscalzo, “Network medicine: anetwork-based approach to human disease.” Nature Reviews. Genetics,vol. 12, no. 1, pp. 56–68, 2011.
[5] M. Vidal, M. E. Cusick, and A.-L. Barabasi, “Interactome networksand human disease.” Cell, vol. 144, no. 6, pp. 986–98, 2011.
[6] B. Lehner, “Genotype to phenotype: lessons from model organismsfor human genetics.” Nature Reviews. Genetics, vol. 14, no. 3, pp.168–178, 2013.
[7] K.-I. Goh, et al., “The human disease network.” Proceedings of theNational Academy of Sciences of the United States of America, vol.104, no. 21, pp. 8685–90, 2007.
[8] American Autoimmune Related Diseases Association, Inc. List ofAutoimmune and Autoimmune-Related Diseases. [Online]. Available:http://www.aarda.org
[9] L. Hindorff, et al. (2012) A Catalog of Pub-lished Genome-Wide Association Studies. [Online]. Available:http://www.genome.gov/gwastudies
[10] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’networks.” Nature, vol. 393, no. 6684, pp. 440–2, 1998.
[11] A.-L. Barabasi and R. Albert, “Emergence of Scaling in RandomNetworks,” Science, vol. 286, no. October, pp. 509–512, 1999.
[12] C. V. Cannistraci, G. Alanis-Lobato, and T. Ravasi, “From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks,” Scientific Reports, vol. 3,pp. 1–13, 2013.
[13] T. Opsahl, “Triadic closure in two-mode networks: Redefining theglobal and local clustering coefficients,” Social Networks, vol. 35,no. 2, pp. 159–167, 2013.
[14] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fastunfolding of communities in large networks,” Journal of StatisticalMechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008,2008.
[15] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open sourcesoftware for exploring and manipulating networks,” 2009.
[16] J. Wang, Z. Du, R. Payattakool, P. Yu, and C.-F. Chen, “A new methodto measure the semantic similarity of GO terms.” Bioinformatics,vol. 23, no. 10, pp. 1274–81, 2007.
[17] T. Obayashi, et al., “COXPRESdb: a database of coexpressed genenetworks in mammals.” Nucleic Acids Research, vol. 36, no. Databaseissue, pp. D77–82, 2008.
[18] M. J. H. Coenen, et al., “Common and different genetic background forrheumatoid arthritis and coeliac disease.” Human Molecular Genetics,vol. 18, no. 21, pp. 4195–203, 2009.
[19] M. R. Burge and J. D. Carey, “Vitiligo associated with subcutaneousinsulin lispro infusion in Type 1 Diabetes,” Diabetes Care, vol. 27,no. 1, pp. 275–276, 2004.
[20] T. Murata and S. Moriyasu, “Link Prediction of Social NetworksBased on Weighted Proximity Measures,” IEEE/WIC/ACM Interna-tional Conference on Web Intelligence (WI’07), pp. 85–88, 2007.
[21] G. Alanis-Lobato, C. V. Cannistraci, and T. Ravasi, “Exploitation ofgenetic interaction network topology for the prediction of epistaticbehavior,” Genomics, vol. 102, no. 4, pp. 202–208, 2013.
[22] D. W. Huang, B. T. Sherman, and R. Lempicki, “Systematic andintegrative analysis of large gene lists using DAVID bioinformaticsresources.” Nature Protocols, vol. 4, no. 1, pp. 44–57, 2009.
170