[IEEE 2014 Middle East Conference on Biomedical Engineering (MECBME) - Doha, Qatar (2014.02.17-2014.02.20)] 2nd Middle East Conference on Biomedical Engineering - Exploring the genetics

Exploring the Genetics Underlying Autoimmune Diseases with NetworkAnalysis and Link Prediction *

Gregorio Alanis-Lobato, Carlo V. Cannistraci and Timothy Ravasi

Abstract— Ever since the first Genome Wide AssociationStudy (GWAS) was carried out we have seen an importantnumber of discoveries of biological and clinical relevance.However, there are some scientists that consider that theseresearch outcomes and their utility are far from what wasexpected from this experimental design. We instead believethat the thousands of genetic variants associated with complexdisorders by means of GWASs are an extremely valuable sourceof information that needs to be mined in a different way. Basedon this philosophy, we followed a holistic perspective to analyzeGWAS data and explored the structural properties of thenetwork representation of one of these datasets with the aim toadvance our understanding of the genetic intricacies underlyingautoimmune human diseases. The simplicity, computationalefficiency and precision of the tools proposed in this paperrepresent a new means to address GWAS data and contributeto the better exploitation of these rich sources of information.

I. INTRODUCTION

Despite the thousands of genetic variants associated with

human complex traits since the first Genome Wide As-

sociation Study (GWAS) [1], the independent contribution

of variants to genetic risk has failed to fully explain the

heritability of human diseases [2], [3]. It has been long

recognized that bio-process regulation is possible only thanks

to the relationships between genes and the proteins they

produce [3], which advocates the analysis of GWAS data

from a network perspective.

Day by day, more comprehensive and complex biological

datasets become available and prompt us to make use of

smart strategies to quickly and accurately analyse these

valuable sources of information. One such a strategy, the

emerging field of Network Medicine, not only allows for

the systematic study of the molecular intricacies underlying

complex traits but it also serves as a proxy to identify the

relationships between apparently different phenotypes [4].

While metabolic, protein interaction, virus-host and regu-

latory networks are important to understand the interdepen-

dencies between molecular components in the cell [5], [4],

genetic interaction networks (or epistatic networks) represent

abstract relationships between genes associated with genetic

buffering. It is thought that these networks can be the key

*This work was supported by King Abdullah University of Science andTechnology (KAUST)

G. Alanis Lobato, C. V. Cannistraci and T. Ravasi are with theDepartment of Bioscience and the Department of Computer Science,KAUST, 4700 KAUST, 23955-6900, Thuwal, Saudi Arabia (phone: +966-564-75-14-14, email: [email protected]; email: [email protected]; email: [email protected])

C. V. Cannistraci is with the Biomedical Cybernetics Group, Biotech-nology Center, Technische Universitat Dresden, Tatzberg 47-51, 01307,Dresden, Germany

to uncover the missing heritability that single gene effects

cannot explain [3]. Unfortunately, although well-established

techniques exist to detect gene interactions in model or-

ganisms [6], their determination in human is still work-in-

progress.

In 2007, Goh and colleagues [7] proposed a work around

this shortcoming and built a human disease gene network by

means of a method that we here exploit to construct an Au-

toimmune Disease Network (ADN, see Sec. II). We decided

to focus on autoimmune disorders due to their relevance in

terms of people affected, direct health care costs and research

funding when compared to, for example, cancer (23.5 million

vs 9 million Americans, 100 billion vs 57 billion USD and

591 million vs 6.1 million USD, respectively) [8]. The basic

topological properties of the ADN, the Gene Ontology (GO)

and expression similarities of its constituting genes and the

modules that they form are studied in Secs. II and III. In

Sec. IV we introduce the weighted version of link predictors

that we proposed in recent publications and benchmark them

against classical prediction techniques to later choose the

best approach and analyse the biological implications of its

top-scored candidate interactions. Our concluding remarks

appear in Sec. V.

II. NETWORK REPRESENTATION OF GWAS DATA

A. Autoimmune Disease Network

Based on the list of autoimmune diseases curated by the

American Autoimmune Related Disease Association [8], we

extracted genes associated with autoimmune disorders from

the GWAS Catalog [9] and constructed a bipartite network

of diseases and their associated genes (Fig. 1, left panel).

We then projected this network to its one-mode form, where

a pair of genes is linked with a weighted edge indicating

the number of diseases they are both associated with (Fig. 1,

right panel). We refer to this undirected, weighted network

as the ADN.

B. Structural Properties of the ADN

Most real networks, representing dynamic and hetero-

geneous systems, present the small world property (high

clustering coefficient and low average shortest path length

when compared to Erdos-Renyi (ER) random graphs) [10],

have heavy-tailed node degree distributions (which indicate

self-organization and presence of hubs, i.e. highly connected

entities) [11] and their structure follows the local community

paradigm or LCP (common neighbours of connected pairs

of nodes are also connected) [12]. We compared the small-

world features of 1000 ER graphs with the same number of

2014 Middle East Conference on Biomedical Engineering (MECBME)February 17-20, 2014, Hilton Hotel, Doha, Qatar

978-1-4799-4799-7/14/$31.00 ©2014 IEEE 167

A

B

C

D

a

b

c

d

e

f

a

b

c

d

e

f

1 2

2

Diseases Genes

Bipartite network Weighted one-mode projection ofthe bipartite network

11

11

1

Fig. 1. One-mode projection of the bipartite network of genes and diseases.

0 100 200 300 400 500 6000

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

−3

P(k

)

Node degree (k)0 100 200 300 400 500

0

20

40

60

80

100

120

140

160

180

200

Number of common neighbors

√ Li

nks

betw

een

com

mon

nei

ghbo

rs

LCP Decomposition PlotNode Degree Distribution

Fig. 2. The node degree distribution and LCP decomposition plot of theADN.

nodes and edges (840 and 49,485 respectively) as the ADN

and found that the latter is far from being random: it has a

clustering coefficient of 0.902 (ER graph = 0.14±0.007) and

an average shortest-path of 2.009 (ER graph = 1.86±0.006),

i.e. it does present the small world property.

Although the node degree distribution of the ADN does

not decay as a power law, it presents a right heavy tail

that indicates the presence of network hubs (Fig. 2, left

panel). Furthermore, the correlation between the number

of common neighbors of connected genes in the network

and the square root of the number of connections between

them (LCP-correlation) is very high (0.97) as depicted by

the LCP Decomposition Plot (Fig. 2, right panel). The

very high values of clustering (quite common in one-mode

projections of bipartite networks [13]) and LCP-correlation

indicate that the network is organized in tightly connected

communities of genes and we can thus take advantage of

this to discover disease modules and perform neighborhood-

based link predicition (more details in Secs. III and IV).

III. COMMUNITY DETECTION

After the application of a community detection algorithm

based on Modularity Optimization (maximization of com-

munity quality by finding groups of nodes that are densely

connected internally and that, at the same time, are connected

between them by only a few edges) [14], we found that genes

associated with the same or with related diseases clustered

together (Fig. 3).

A. Biological Properties of the Detected Communities

The majority of the detected communities are part of the

main component of the ADN and contain genes associated

with related diseases. Nevertheless, a couple of modules

formed by genes associated with a particular disease are

also present (Fig 3), as it is the case for the Restless

Legs Syndrome or Dilated Cardiomyopathy communities.

GO Semantic Similarity [16] and Gene CoExpression [17]

computation for each community and the complete network,

shows large values of the former and low of the latter (see

bars next to each community in Fig. 3), which indicates that

the analyzed pairs of genetic variants can be expressed in, for

example, different tissues and yet have important phenotypic

impact. As mentioned above, community detection grouped

genes associated with related diseases (e.g. the green commu-

nity of genes associated with inflammatory disorders) but this

was not always the case, like in the pink and grass colored

communities (Fig. 3). The first one clusters genes associated

with Rheumatoid Arthritis (RA) and Celiac Disease (CD),

two disorders with two very different phenotypes: chronic

joint inflammation and intestine inflammation respectively

but that can nevertheless show some common features: HLA

molecule association, T-cell infiltration in target organs and

high degree of comorbidity [18]. The diseases represented

in the second one, Type I Diabetes (TID) (autoimmune

destruction of insulin-producing β-cells) and Vitiligo (skin

depigmentation due to melanocyte malfunction), are also

phenotypically different but it is known that patients with

TID tend to present skin pigmentation problems, especially

in areas of insulin injection [19].

IV. LINK PREDICTION

We assessed the performance of multiple topological link

predictors (indices that assign likelihood scores to non-

adjacent pairs of nodes based on the network topology)

when applied to the communities that are not completely

connected (namely the blue, pink, green, grass and magenta

modules). This evaluation is done by removing edges from

these networks and assessing the ability of the predictors

to rediscover them up to the point where the network loses

connectivity.

A. Weighted Link Predictors

As mentioned in Sec. II, the ADN is a weighted network

and because of this, the weighted version of the classical

prediction techniques [20] was considered in this paper

(see Table I, Eqs. 1-3). Besides, we introduce the weighted

version of two techniques that we proposed in recent publi-

cations for the prediction of epistatic interactions [21] (Table

I, Eq. 4) and links in general (Table I, Eq. 5) [12].

B. Prune-rediscover Performance Evaluation

Fig. 4 depicts the very good performance of all indices

and highlights the sensitivity of wCAR to the loss of the

local community structure of the network: as links are

removed from the topology, local communities disappear and

wCAR loses precision. The power of neighborhood-based

168

Restless legssyndrome

Dilated cardiomyopathy

Grave’s disease

Endometriosis

Primary biliary cirrhosis

Eosinophilicesophagitis (pediatric)

Type I diabetes, Vitiligo

AIDS,Hypothyroidism,

Psoriasis

Multiple sclerosis

Ankylosing spondylitis,Crohn’s disease,

Primary sclerosing cholangitis,Ulcerative colitis

Celiac disease,Rheumatoid arthritis

Diabetic retinopathy,IgA nephropathy,

Kawasaki disease,Systemic lupus erythematosus

Avg. Gene Ontology Similarity (Wang Index)

Avg. Gene CoExpression (Pearson correlation)

0.42

0.10

0.60

0.06

0.68

0.07

0.63

0.06

0.62

0.07

0.65

0.11

0.62

0.07

0.50

0.03

0.75

0.06

0.82

0.12

0.74

0.05

0.59

0.07

0.64

0.07

Disease Network

Fig. 3. The ADN colored according to the communities detected by Modularity Optimization. This network visualization was done with Gephi 0.8.2 [15]

TABLE I

WEIGHTED LINK PREDICTORS

Predictor Formulation Eq. Number

Common Neighbors (wCN) wCN(x, y) =∑

z∈Γ(x)∩Γ(y)w(x,z)+w(y,z)

21

Preferential Attachment (wPA) wPA(x, y) = k′x × k′y 2

Adamic and Adar (wAA) wAA(x, y) = wCN(x, y)× 1

log(∑

z′∈Γ(z) w(z′,z))

3

Adjusted Czekanowski-Dice Dissimilarity (wACDD) wACDD(x, y) =2wCN(x,y)

k′x+k′

y+λx+λy4

Cannistraci-Alanis-Ravasi Index (wCAR) wCAR(x, y) = wCN(x, y)× wLCL 5

The penalization parameters in wACDD correspond to λx = max(0, navg − k′x), navg is the weighted average node degree of the network, k′x is theweighted degree of node x and wLCL is the sum of the weights of the links connecting the common neighbors of x and y (local community links)

approaches is expected, given the high values of cluster-

ing and LCP-correlation of the network. These structural

characteristics explain why wPA is unable to mimic the

performance of the rest: simple weighted degree products

are not enough for prediction in a network dominated by

gene pairs that share lots of neighbors.

Although wCN, wAA and wACDD present very simi-

lar performance patterns, the latter outperforms the other

techniques in most of the disease modules. We analysed

the biological relevance of the top 3 candidate interactions

advocated by this approach for each gene communities (see

next Sec.).

C. Biological Interpretation of Best Candidate Interactions

The biological analysis that led to the conclusions made

in this section, was performed with DAVID Bioinformatics

Resources 6.7 [22].

The top candidate links in the blue community correspond

to interactions of genes AFF1, AGXT2L1 and AG1P7 with

INSR. The products of these genes are involved in proper

blood filtering and toxin secretion. Thus, their simultaneous

mutations may explain the blood filtering problems in IgA

Nephropathy or the affection of blood vessels in Kawasaki

Disease or Lupus. The top candidates in the pink com-

munity are links between HLA-DQA2, HLA-DQB1 (T-cell

multiplication) and LPP (cell adhesion and motility), which

could explain chronic inflammation, and the link between

C1QTNF6 (regulation of immune cells) and IL2 (regulation

of Th17 cells), which could explain the high levels of Th17

cells in the affected tissues of RA and CD patients. The

top candidates edges in the green community are TMEM17-

COG6, COG6-B3GNT2 and B3GNT2-PARK7. Genetic vari-

ants in TMEM17, COG6 and B3GNT2 would directly impact

protein transport and their interaction with PARK7 could

produce aberrations in the proper protection against cell

stress. For the grass community, the top candidate links

are the interactions of ADPRH, ATP8B1 and CASP7 with

C14orf181. The latter is an uncharacterized gene that has

169

05001000150020002500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Remaining links

Avg

. Pre

cisi

on

wCN wAA wPA wACDD wCAR Random

010002000300040005000

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Remaining links

020004000600080001000012000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Remaining links

05001000150020002500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Remaining links

0200400600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Remaining links

Diabetic retinopathy, IgA nephropathy,

Kawasaki disease,Systemic lupus erythematosus

Celiac disease,Rheumatoid arthritis

Ankylosing spondylitis,Crohn’s disease,

Primary sclerosing cholangitis,Ulcerative colitis

Type I diabetes, Vitiligo

AIDS,Hypothyroidism,

Psoriasis

Fig. 4. Performance evaluation of classical and proposed weighted link predictors.

been recently removed from several databases due to the

presence of a more interesting gene in the same chromosome.

However, not only was it part of our top 3 candidates but

also of the top 10. Given the association of ATP8B1 with

bile disorders and the high expression of CASP7 in liver,

it would be interesting to put more attention to C14orf181.

Finally, the set of top candidate interactions for the magenta

community are CAST-CCR5, CCR5-ANXA6 and ANXA6-

CCRL2, all involved in Calcium (Ca) regulation and intra-

cellular mobilization. This means that mutations in this set

of genes may impact signaling by Ca ions and explain Ca-

associated problems in AIDS or Psoriasis. Ca-regulation by

the thyroid gland could also be affected.

V. CONCLUSIONS

We analyzed GWAS data from a systems perspective,

which deems autoimmune disease genes not as isolated

entities but as important parts of a whole. The value of

the disease modules detected by Modularity Optimization

and the precision of the link predictors introduced, indicate

that they are powerful and trustworthy tools. We encourage

researchers to integrate Systems Biology and Network Sci-

ence into their projects because the knowledge generated by

these scientific devices could help us untangle the intricate

relationships between genotype and phenotype.

ACKNOWLEDGMENT

The authors thank all members of the Integrative Systems

Biology Lab at KAUST for their support, especially Maria

Contreras for proofreading the manuscript.

REFERENCES

[1] The Wellcome Trust Case Control Consortium, “Genome-wide asso-ciation study of 14,000 cases of seven common diseases and 3,000shared controls.” Nature, vol. 447, no. 7145, pp. 661–78, 2007.

[2] J. S. Bloom, I. M. Ehrenreich, W. T. Loo, T.-L. V. o. Lite, andL. Kruglyak, “Finding the sources of missing heritability in a yeastcross.” Nature, vol. 494, no. 7436, pp. 234–237, 2013.

[3] B. A. McKinney and N. M. Pajewski, “Six Degrees of Epistasis:Statistical Network Models for GWAS.” Frontiers in Genetics, vol. 2,no. January, pp. 1–6, 2011.

[4] A.-L. Barabasi, N. Gulbahce, and J. Loscalzo, “Network medicine: anetwork-based approach to human disease.” Nature Reviews. Genetics,vol. 12, no. 1, pp. 56–68, 2011.

[5] M. Vidal, M. E. Cusick, and A.-L. Barabasi, “Interactome networksand human disease.” Cell, vol. 144, no. 6, pp. 986–98, 2011.

[6] B. Lehner, “Genotype to phenotype: lessons from model organismsfor human genetics.” Nature Reviews. Genetics, vol. 14, no. 3, pp.168–178, 2013.

[7] K.-I. Goh, et al., “The human disease network.” Proceedings of theNational Academy of Sciences of the United States of America, vol.104, no. 21, pp. 8685–90, 2007.

[8] American Autoimmune Related Diseases Association, Inc. List ofAutoimmune and Autoimmune-Related Diseases. [Online]. Available:http://www.aarda.org

[9] L. Hindorff, et al. (2012) A Catalog of Pub-lished Genome-Wide Association Studies. [Online]. Available:http://www.genome.gov/gwastudies

[10] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’networks.” Nature, vol. 393, no. 6684, pp. 440–2, 1998.

[11] A.-L. Barabasi and R. Albert, “Emergence of Scaling in RandomNetworks,” Science, vol. 286, no. October, pp. 509–512, 1999.

[12] C. V. Cannistraci, G. Alanis-Lobato, and T. Ravasi, “From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks,” Scientific Reports, vol. 3,pp. 1–13, 2013.

[13] T. Opsahl, “Triadic closure in two-mode networks: Redefining theglobal and local clustering coefficients,” Social Networks, vol. 35,no. 2, pp. 159–167, 2013.

[14] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fastunfolding of communities in large networks,” Journal of StatisticalMechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008,2008.

[15] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open sourcesoftware for exploring and manipulating networks,” 2009.

[16] J. Wang, Z. Du, R. Payattakool, P. Yu, and C.-F. Chen, “A new methodto measure the semantic similarity of GO terms.” Bioinformatics,vol. 23, no. 10, pp. 1274–81, 2007.

[17] T. Obayashi, et al., “COXPRESdb: a database of coexpressed genenetworks in mammals.” Nucleic Acids Research, vol. 36, no. Databaseissue, pp. D77–82, 2008.

[18] M. J. H. Coenen, et al., “Common and different genetic background forrheumatoid arthritis and coeliac disease.” Human Molecular Genetics,vol. 18, no. 21, pp. 4195–203, 2009.

[19] M. R. Burge and J. D. Carey, “Vitiligo associated with subcutaneousinsulin lispro infusion in Type 1 Diabetes,” Diabetes Care, vol. 27,no. 1, pp. 275–276, 2004.

[20] T. Murata and S. Moriyasu, “Link Prediction of Social NetworksBased on Weighted Proximity Measures,” IEEE/WIC/ACM Interna-tional Conference on Web Intelligence (WI’07), pp. 85–88, 2007.

[21] G. Alanis-Lobato, C. V. Cannistraci, and T. Ravasi, “Exploitation ofgenetic interaction network topology for the prediction of epistaticbehavior,” Genomics, vol. 102, no. 4, pp. 202–208, 2013.

[22] D. W. Huang, B. T. Sherman, and R. Lempicki, “Systematic andintegrative analysis of large gene lists using DAVID bioinformaticsresources.” Nature Protocols, vol. 4, no. 1, pp. 44–57, 2009.

170

Documents

[IEEE 2014 Middle East Conference on Biomedical Engineering (MECBME) - Doha, Qatar (2014.02.17-2014.02.20)] 2nd Middle East Conference on Biomedical Engineering - Exploring the genetics