4
Employing Weighted Biological Network Structure for Finding Disease Genetic Markers in SNP Association Studies Jaeyoung Kim and Miyoung Shin Bio-Intelligence & Data-Mining Laboratory Graduate School of EECS, Kyungpook National University 1370 Sankyuk-dong, Buk-gu, Daegu, Korea 702-701 E-mail: [email protected], [email protected] Abstract—SNP association study has been widely performed to find disease-related genetic markers usually by investigating the difference of SNP genotype frequencies between disease and non-disease samples and evaluating its significance in a statistical sense. However, such approach often incurs the problem of producing tie scores over multiple SNPs, especially when the number of samples is not large enough. In addition, for the finding of genetic markers related to complex diseases, such as cancer, various environmental or functional factors need to be taken into consideration. Thus, to deal with these problems, we examine a new analytical framework to the identification of disease-related genetic markers that can incorporate a variety of a priori known gene functional factors into SNP association study. Specifically, using any biological resources of interest, we build up a gene network structure and gives high ranks to SNPs associated with hub genes (or core genes), which have high connectivity to many other genes, in determining disease markers. For experiments, we constructed network structure with gene ontology annotations and cancer modules, respectively, and identified genetic markers related to prostate cancer and non-Hodgkin lymphoma, respectively. The results demonstrated that the use of network structure constructed with available various biological resources can lead to the better finding of disease-related genetic markers in SNP association studies. Keywords- SNP association study; SNP genotype data; disease-related genetic markers I. INTRODUCTION SNP association study is one of the primary challenges in recent bioinformatics research [1, 2] , which aims to find disease-related genetic markers by examining SNP genotype data of disease and non-disease samples. Reportedly, approximately 80% out of currently known disease markers are related to SNPs [3] . However, although it is one of the most important factors that determines disease phenotypes, SNP genotype data alone does not seem to be enough especially for the identification of genetic markers related to complex diseases like cancer. This is because the change of allele(s) caused by environmental factors often causes the mutation of a gene, which affects the proper functioning of such gene or its associated genes. Particularly, the change of gene interaction(s) might lead to the development of To whom the correspondence should be made complex diseases like cancers. Thus, it seems desirable to investigate additional information which carries a variety of environmental or functional factors associated with specific diseases. For association studies, SNP genotype data have been largely produced by various SNP chip platforms [1] . In such cases, SNP genotypes (e.g. AA, AB, or BB) are called from the intensity values of probes on a chip and disease genetic markers are found by exploring the difference of SNP genotype data between case and control samples. Conventionally, for this purpose, most approaches investigate only the difference of genotype frequencies between disease and non-disease group samples, although some approaches (e.g., Cochran-Armitage test [4] or MAX test [5] ) consider genetic models along with genotype frequencies. For this reason, they often incur the problem of producing tie scores over multiple SNPs, especially when the sample size is not large enough. In this paper, to handle such issues, we examine a new approach to finding disease genetic markers that provides an ability to integrate a variety of biological resources into SNP association study, which eventually leads to solving the tie group problem. The organization of this paper is as follows. Section 2 details our proposed approach to SNP association study and Section 3 evaluates its performance with two experiment data-sets and shows the results. Finally, Section 4 presents concluding remarks with some discussions. II. OUR APPROACH TO SNP ASSOCIATION STUDY As mentioned earlier, many of the common approaches to SNP association study often explores only the difference of genotype data between disease and non-disease samples, so it tends to produce many tie scores over candidate SNPs. To deal with this problem, we attempt to incorporate several biological resources into SNP association study within the modified GeneRank framework. Specifically, we employ the gene-level score vector obtained from gene-associated SNP scores and the weighted gene network structure derived from several biological resources [6, 7] to prioritize candidate genetic markers related to certain disease. By doing this, consequently, our approach can identify disease genetic markers by taking the difference of SNP genotype frequencies between case and control samples along with various biological resources into consideration. 2011 IEEE International Conference on Bioinformatics and Biomedicine 978-0-7695-4574-5/11 $26.00 © 2011 IEEE DOI 10.1109/BIBM.2011.69 135

[IEEE 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Atlanta, GA, USA (2011.11.12-2011.11.15)] 2011 IEEE International Conference on Bioinformatics and

  • Upload
    miyoung

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Atlanta, GA, USA (2011.11.12-2011.11.15)] 2011 IEEE International Conference on Bioinformatics and

Employing Weighted Biological Network Structure for Finding Disease Genetic Markers in SNP Association Studies

Jaeyoung Kim and Miyoung Shin Bio-Intelligence & Data-Mining Laboratory

Graduate School of EECS, Kyungpook National University 1370 Sankyuk-dong, Buk-gu, Daegu, Korea 702-701

E-mail: [email protected], [email protected]

Abstract—SNP association study has been widely performed to find disease-related genetic markers usually by investigating the difference of SNP genotype frequencies between disease and non-disease samples and evaluating its significance in a statistical sense. However, such approach often incurs the problem of producing tie scores over multiple SNPs, especially when the number of samples is not large enough. In addition, for the finding of genetic markers related to complex diseases, such as cancer, various environmental or functional factors need to be taken into consideration. Thus, to deal with these problems, we examine a new analytical framework to the identification of disease-related genetic markers that can incorporate a variety of a priori known gene functional factors into SNP association study. Specifically, using any biological resources of interest, we build up a gene network structure and gives high ranks to SNPs associated with hub genes (or core genes), which have high connectivity to many other genes, in determining disease markers. For experiments, we constructed network structure with gene ontology annotations and cancer modules, respectively, and identified genetic markers related to prostate cancer and non-Hodgkin lymphoma, respectively. The results demonstrated that the use of network structure constructed with available various biological resources can lead to the better finding of disease-related genetic markers in SNP association studies.†

Keywords- SNP association study; SNP genotype data; disease-related genetic markers

I. INTRODUCTION SNP association study is one of the primary challenges in

recent bioinformatics research [1, 2], which aims to find disease-related genetic markers by examining SNP genotype data of disease and non-disease samples. Reportedly, approximately 80% out of currently known disease markers are related to SNPs [3]. However, although it is one of the most important factors that determines disease phenotypes, SNP genotype data alone does not seem to be enough especially for the identification of genetic markers related to complex diseases like cancer. This is because the change of allele(s) caused by environmental factors often causes the mutation of a gene, which affects the proper functioning of such gene or its associated genes. Particularly, the change of gene interaction(s) might lead to the development of

† To whom the correspondence should be made

complex diseases like cancers. Thus, it seems desirable to investigate additional information which carries a variety of environmental or functional factors associated with specific diseases.

For association studies, SNP genotype data have been largely produced by various SNP chip platforms [1]. In such cases, SNP genotypes (e.g. AA, AB, or BB) are called from the intensity values of probes on a chip and disease genetic markers are found by exploring the difference of SNP genotype data between case and control samples. Conventionally, for this purpose, most approaches investigate only the difference of genotype frequencies between disease and non-disease group samples, although some approaches (e.g., Cochran-Armitage test [4] or MAX test [5]) consider genetic models along with genotype frequencies. For this reason, they often incur the problem of producing tie scores over multiple SNPs, especially when the sample size is not large enough.

In this paper, to handle such issues, we examine a new approach to finding disease genetic markers that provides an ability to integrate a variety of biological resources into SNP association study, which eventually leads to solving the tie group problem. The organization of this paper is as follows. Section 2 details our proposed approach to SNP association study and Section 3 evaluates its performance with two experiment data-sets and shows the results. Finally, Section 4 presents concluding remarks with some discussions.

II. OUR APPROACH TO SNP ASSOCIATION STUDY As mentioned earlier, many of the common approaches

to SNP association study often explores only the difference of genotype data between disease and non-disease samples, so it tends to produce many tie scores over candidate SNPs. To deal with this problem, we attempt to incorporate several biological resources into SNP association study within the modified GeneRank framework. Specifically, we employ the gene-level score vector obtained from gene-associated SNP scores and the weighted gene network structure derived from several biological resources [6, 7] to prioritize candidate genetic markers related to certain disease. By doing this, consequently, our approach can identify disease genetic markers by taking the difference of SNP genotype frequencies between case and control samples along with various biological resources into consideration.

2011 IEEE International Conference on Bioinformatics and Biomedicine

978-0-7695-4574-5/11 $26.00 © 2011 IEEE

DOI 10.1109/BIBM.2011.69

135

Page 2: [IEEE 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Atlanta, GA, USA (2011.11.12-2011.11.15)] 2011 IEEE International Conference on Bioinformatics and

A. Modified GeneRank Algorithm For the prioritization of candidate genetic markers

associated with diseases, we employ a modified version of GeneRank algorithm which was originally proposed for DNA microarray data analysis [8]. In the modified GeneRank algorithm, the ranking score is calculated from gene-associated SNP scores and gene network structure, based on the following formula:

( )1 (1 )Td r dI W D S−− = − (2) where I denotes an identity matrix of n × n (n: the total number of genes related to all SNPs), W denotes an adjacency matrix of n × n where each entry wij is defined as the weight of the link between gene i and gene j, and S denotes a gene-associated SNP score matrix of n × 1 where the ith component is defined as the maximum of the normalized SNP scores associated with gene i. Here the normalized score value is obtained by dividing each SNP score by the maximum of the absolute values of all the SNP scores. Also, D corresponds to a diagonal matrix of n × n where dii contains the sum of the degree of gene i in the network structure represented by W, r corresponds to the output ranking score vector, and d denotes a control parameter having the value in the range of [0, 1] which is to adjust the weight of network structure relative to gene-associated SNP scores in calculating final ranking scores. As d gets closer to 1, more weights are given to network structure than gene-associated SNP scores.

B. Gene Network Construction For the use of modified GeneRank algorithm, we

construct two types of gene networks, which have unweighted and weighted network structure, from publicly available resources like gene ontology (GO) [7] annotations and cancer modules (CM) [6].

1) Unweighted Network Structure From GO annotations, if any two genes share a certain

biological function in common, a genetic link between them is simply defined. That is, the two genes having the same functionalities are considered as being somehow related to each other. Similarly, a cancer module (CM) refers to a group of genes acting together to perform a specific function related to cancers, which was originally constructed by microarray gene expression studies. Thus, for network construction, if any two genes belong to the same CM, a genetic link between them is added to the network structure. This means that all member genes within the same module are considered to be related to each other. Figure 1 (a) and (b) show biological networks built up from GO annotations and CMs, respectively.

2) Weighted Network Structure The unweighted structure of gene networks often fails to

carry sophisticated information about the links between genes. To overcome this problem, we employ a link-weighting scheme for biological network construction. Through this weighting scheme, the strength of the genetic links in the network structure can be flexibly defined.

(a) GO based network structure

(b) CM based network structure

Figure 1. Examples of constructing unweighted gene networks: (a) GO annotations based and (b) CM based network structures.

For the building of the weighted network structure, we consider (1) the number of biological functions shared between two genes, and (2) the number of cancer modules shared between two genes. This is based on the assumption that if a gene shares more functions or modules with the other gene, the link between two genes should be given additional weights. Specifically, the weight of the link between gene A and gene B, denoted by W(A,B), is given based on the following formula:

)()(),( BFAFBAW ∩= (1)

where F(A) and F(B) denote the sets of the functions (or modules) owned by gene A and gene B, respectively, and |.| denotes the cardinality of the set. Thus, as the number of the functions or modules shared between two genes gets larger, the bigger strength of the link between them is given.

III. EXPERIMENTS AND RESULTS For performance analysis of our proposed method, we

used the prostate cancer SNP chip data [9] produced by Castro et al. (2009) and the non-Hodgkin lymphoma SNP chip data [10] produced by Wojnowski et al. (2005) to identify disease gene markers. The biological validation of the identified disease gene markers was done by using a priori known prostate cancer related genes and Hodgkin lymphoma related genes as the gold standards, respectively.

Our experiments were made by firstly obtaining two kinds of SNP scores each with MAX test [5] and ReliefF method [11], respectively. Secondly, the unweighted (or weighted) gene networks were constructed from GO annotations and CMs, respectively. Then, the gene-

136

Page 3: [IEEE 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Atlanta, GA, USA (2011.11.12-2011.11.15)] 2011 IEEE International Conference on Bioinformatics and

associated SNP scores and the gene network were given as the input to the modified GeneRank algorithm to prioritize candidate genetic markers. Finally, the evaluation was carried out by comparing the top n-most significant genes with true genetic markers in the gold standard. That is, it was made by counting the number of true disease related genes included in top n-ranked genes with respect to a different selection of a control parameter d in [0, 1], e.g. d=0.5 or 0.7.

A. Prostate Cancer Data Analysis The prostate cancer SNP chip data [9] used in our

experiments was originally obtained with Affymetrix GeneChip 500K by Castro et al. (2009). This dataset consists of 20 prostate cancer patient samples and 20 normal samples, where each sample includes 500,568 SNP genotype data. By applying this dataset to SNP association study, we intended to evaluate the usefulness of unweighted (or weighted) network structure derived from several biological resources for the finding of disease genetic markers. Here the validation of the identified genetic markers was made by using a priori known 263 prostate cancer related gene symbols provided by SABiosciences [12].

Table 1 shows the analysis results obtained by using biological process terms in GO annotations for network construction. From this table, we can observe that the incorporation of biological networks into SNP association study is very effective to improve the performance of such existing methods like MAX and ReliefF. Also, when d-value gets larger (i.e. when more weights are given to the network structure), the better findings of disease related genetic marker could be achieved. Overall, the best result was obtained when weighted network structure is used with a control parameter d=0.7.

Table 1. Comparison of the results in prostate cancer dataset when using

GO annotations for network construction.

Methods Number of true disease genetic markers in top n-ranked genes

100 200 300 400 500 MAX 3 7 8 9 13 ReliefF 3 7 8 9 14 MAX + GO (d=0.5, Unweighted) 3 5 7 12 14 ReliefF+GO(d=0.5, Unweighted) 5 5 9 13 24 MAX + GO (d=0.5, Weighted) 4 5 10 17 24 ReliefF+GO (d=0.5, Weighted) 6 8 15 24 28 MAX + GO (d=0.7, Unweighted) 5 5 7 18 23 ReliefF+GO(d=0.7, Unweighted) 6 10 15 24 28 MAX + GO (d=0.7, Weighted) 10 17 19 26 32 ReliefF+GO (d=0.7, Weighted) 15 22 24 28 37

Next, the results obtained by using CMs for network

construction are shown in Table 2. The best result was obtained when the weighted network structure is used with a control parameter d=0.7. In addition, when d-value gets larger, the more disease genetic markers could be found.

Table 2. Comparison of the results in prostate cancer dataset when using cancer modules for network construction.

Methods Number of true disease genetic markers in top n-ranked genes

100 200 300 400 500 MAX 3 7 8 9 13 ReliefF 3 7 8 9 14 MAX + CM (d=0.5, Unweighted) 3 5 6 13 15 ReliefF+CM(d=0.5, Unweighted) 3 5 11 15 21 MAX + CM (d=0.5, Weighted) 3 5 7 13 15 ReliefF+CM(d=0.5, Weighted) 3 6 12 16 18 MAX + CM (d=0.7, Unweighted) 4 8 12 13 15 ReliefF+CM(d=0.7, Unweighted) 5 11 13 17 20 MAX + CM (d=0.7, Weighted) 4 7 11 16 19 ReliefF+CM(d=0.7, Weighted) 5 11 17 18 22

From the above results, thus, we infer that the use of

several biological resources along with SNP genotype data for disease association study can improve the performance of finding disease genetic markers.

B. Non-Hodgkin Lymphoma Data Analysis The non-Hodgkin lymphoma SNP chip data [10] used in

our experiments was originally obtained with Affymetrix GeneChip Human Mapping 10K Xba 131 Array by Wojnowski et al. (2005). This dataset consists of 52 non-Hodgkin lymphoma related samples and 52 anthracycline-chemotherapy treated samples, where each sample includes 11,555 SNP genotype data. Here the validation of the analysis results was made by using a priori known 119 non-Hodgkin lymphoma related gene symbols obtained from Genetic Association Database (GAD) [13] as gold standard.

Table 3 shows the results achieved by using biological process terms in GO annotations for network construction. This table indicates that the use of biological networks for SNP association study can be very effective to improve the performance of such existing methods like MAX and ReliefF. Also, when using the weighted network structure and larger d-values up to a certain point, the finding of disease related genetic markers could be slightly improved, although it is not as much as in prostate cancer experiment.

Table 3. Comparison of the results in non-Hodgkin lymphoma dataset

when using GO annotations for network construction.

Methods Number of true disease genetic markers in top n-ranked genes

100 200 300 400 500 MAX 0 0 1 4 5 ReliefF 1 4 4 4 4 MAX + GO(d=0.5, Unweighted) 5 7 9 9 9 ReliefF+GO(d=0.5, Unweighted) 3 9 9 14 15 MAX + GO (d=0.5, Weighted) 5 7 10 11 13 ReliefF+GO (d=0.5, Weighted) 5 8 12 14 18 MAX + GO(d=0.7, Unweighted) 7 8 9 11 12 ReliefF+GO(d=0.7, Unweighted) 6 10 13 13 15 MAX + GO (d=0.7, Weighted) 5 11 13 15 16 ReliefF+GO (d=0.7, Weighted) 6 13 14 16 17

137

Page 4: [IEEE 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Atlanta, GA, USA (2011.11.12-2011.11.15)] 2011 IEEE International Conference on Bioinformatics and

On the other hand, the results obtained by using CMs for network construction in non-Hodgkin lymphoma dataset are shown in Table 4. Unlike the previous case with GO annotations, overall, the better performance could be achieved when unweighted network structure is used. This may be the reason that the performance depends on the nature of biological resources used for network construction.

Table 4. Comparison of the results in non-Hodgkin lymphoma dataset when using cancer modules for network construction.

Methods Number of true disease genetic markers in top n-ranked genes

100 200 300 400 500 MAX 0 0 1 4 5 ReliefF 1 4 4 4 4 MAX + CM(d=0.5, Unweighted) 3 5 6 13 15 ReliefF+CM(d=0.5, Unweighted) 3 3 5 8 9 MAX + CM (d=0.5, Weighted) 3 6 6 6 7 ReliefF+CM (d=0.5, Weighted) 5 5 6 7 8 MAX + CM (d=0.7, Unweighted) 4 6 8 10 12 ReliefF+CM (d=0.7, Unweighted) 4 6 8 10 14 MAX + CM (d=0.7, Weighted) 5 6 7 8 9 ReliefF+CM (d=0.7, Weighted) 5 5 7 8 10

IV. DISCUSSION AND FUTURE WORKS In this paper we addressed a novel approach to SNP

association study that employs gene network structure derived from a variety of biological resources like GO annotations and cancer modules, as well as SNP genotype data. For the evaluation of the proposed method, we conducted experiments with two publicly available datasets. The results demonstrated that the use of biological network structure for SNP association study can be quite effective to make the better finding of disease genetic markers, while solving the problem of producing many tie scores over candidate SNPs. In particular, the link-weighting scheme for network structure does not only provide the means to adjust the strength of gene connectivity, but also provides more sophisticated framework to incorporate a variety of biological resources into SNP association studies.

As future works, we currently plan to investigate the use of other biological resources, such as pathway databases, cytogenetic bands, and etc., for the finding of disease genetic markers. Also, other weighting schemes are under consideration for the better construction of network structure that well reveals gene interactions defined by the characteristics of biological resources. Finally, the issue of choosing appropriate d-values and their effects on the

identification of disease genetic markers will be explored further.

ACKNOWLEDGMENT This work was supported by the Korea Research

Foundation of Korea (KRF) grant funded by the Korea government (MEST) (No. 2009-0067724).

REFERENCES [1] J. Hoh, A. Wille and J. Ott, “Trimming, weighting, and grouping

SNPs in human case-control association studies.” Genome Res.,vol. 11(12): pp. 2115-2119, Dec. 2001.

[2] J. N. Hirschhorn and M. J. Daly, “Genome-wide association studies for common diseases and complex traits.” Nat. Rev. Genet., vol. 6, pp. 95–108, 2005.

[3] B. Maher, “Personal genomes: The case of the missing heritability.” Nature, vol. 456, pp. 18-21, November 2008.

[4] B. Freidlin, G. Zheng, Z. Li and J. L. Gastwirth, "Trend tests for case-control studies of genetic markers: power, sample size and robustness." Human heredity, vol. 53(3), pp. 146-52, 2002.

[5] Q. Li, K. Yu, Z. Li and G. Zheng, “MAX-rank: a simple and robust genome-wide scan for case-control association studies,” Hum Genet. vol. 123(6), pp. 617-623, May 2008.

[6] Molecular Signatures Database (MSigDB): http://www.broadinstitute. org/gsea/downloads.jsp

[7] A. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill DP, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin and G. Sherlock, "Gene ontology: tool for the unification of biology.", Nature genetics, vol. 25(1), pp. 25-29, 2000.

[8] J. L. Morrison, R. Breitling, D. J. Higham and D. R. Gilbert, "GeneRank: using search engine technology for the analysis of microarray experiments", BMC Bioinformatics, vol 6:233, doi:10.1186/1471-2105-6-233, 2005.

[9] P. Castro, C. J. Creighton, M. Ozen and D. Berel, “Genomic profiling of prostate cancers from African American men.” Neoplasia, vol. 11(3):pp. 305-312, Mar 2009.

[10] L. Wojnowski, B. Kulle, M. Schirmer, G. Schluter, A. Schmidt, A. Rosenberger, S. Vonhof, H. Bickeboller, M. R. Toliat, E. K. Suk, M. Tzvetkov, A. Kruger, S. Seifert, M. Kloess, H. Hahn, M. Loeffler, P. Nurnberg, M. Pfreundschuh, L. Trumper, J. Brockmoller, and G. Hasenfuss, “NAD(P)H oxidase and multidrug resistance protein genetic polymorphisms are associated with doxorubicin-induced cardiotoxicity”, Circulation, 112, 3754-3762, 2005.

[11] I. Kononenko, “Estimating Attributes: Analysis and Extensions of RELIEF.” European Conference on Machine Learning 1997, pp. 171-182, 1994.

[12] SABiosciences Corporation Home Page, retrieved 2010, from http://www.sabiosciences.com/ArrayList.php

[13] The Genetic Association Database, from http://geneticassociationd b.nih.gov/

[14] W. Y. Wang, B. J. Barratt, D. G. Clayton and J. A. Todd, “Genome-wide association studies: theoretical and practical concerns.” Nat. Rev. Genet., vol. 6, pp. 109–118, 2005.

138