2
Developing Disease Risk Prediction Model Based on Environmental Factors Mingyu Pak 1 School of Electronics Engineering Kyungpook National University Daegu, Korea [email protected] Miyoung Shin 2* School of Electronics Engineering Kyungpook National University Daegu, Korea [email protected] Abstract— Analyzing the effects of various environmental factors on human diseases is one of the important issues in recent bioinformatics studies. In this paper we investigate several environmental factors regarding Type-2 diabetes and select some of them for develop an analytical model of disease risk prediction. For the selection of significant factors, we first preprocessed all the environmental factors into categorical values and then calculated the max/min odds ratios of all the categorized environmental factors. After that, we chose the top-n ranked factors as input features for the prediction model. The disease risk prediction model was developed with SVM classifiers, where training data were built based on Ansan/Ansung Cohort 2 Data obtained from the Korean National Institute of Health (KNIH). Here the data imbalanced problem was occurred in training data, which can be often observed in reality. Thus, to handle this problem, we regenerated the training data by using the SMOTE approach and used them for disease risk prediction modeling. For model evaluation, the proposed method was employed to predict the risk of Type-2 diabetes disease. The experiment results showed that our SVM classifiers based on selective environmental factors could produce very comparable results to the prediction model with genetic factors in forecasting the risk of specific disease. Keywords—Environmental-wide association study; disease risk prediction; SVM classifiers; I. INTRODUCTION Understanding and explaining the complex mechanisms in human disease is one of the fundamental challenge in recent genetic studies. According to earlier studies [1], the underlying mechanisms in many human diseases can be affected by genetic or environmental factors. Although environmental factors have significant role in disease-causing mechanism, so far many works have mainly focused on exploring the effects of genetic factors such as SNP genotypes or gene expression. Recently, however, several studies [2,3] showed the possibilities of affecting some environmental factors on specific diseases, along with genetic factors. Thus, the environmental-wide association studies (EWAS) are becoming attractive to many research groups. In this paper, our aim is to identify the significant environmental factors which are highly associated with a particular disease occurrence of Type-2 diabetes, and to construct a disease risk prediction model based on the selected environmental factors. For this purpose, we used Ansung/Ansan Cohort 2 Data to develop Type-2 diabetes risk prediction model, where the definition of Type-2 diabetes was followed from the standard procedure from World Health Organization (WHO) ** . ** “Definition, Diagnosis and classification of diabetes mellitus and its complications. Part 1 :Diagnosis and classification of diabetes mellitus,” Geneva, World Health Organization, 1999(WHO/NCD/NCS/99.2). II. METHODS A. Experiment data For the development of disease risk prediction model, we used the Ansung/Ansan Cohort 2 Data, which was obtained from the Korean National Institute of Health (KNIH). This dataset is regarding the 8,843 individuals who live in the area of Ansung or Ansan in the province of Kyunggi, Korea. Each individual data consists of SNP genotypes, which was produced by SNP chip (Affymetrix Human Mapping 500K), and 37 environmental factors including gender, age, area, job, handedness, religion, income, a plasma glucose concentration, the relation of diagnosed family, blood urea nitrogen (BUN), creatine, and etc. In this work, our interests are to examine the significance of environmental factors associated with Type-2 diabetes, not that of genetic factors. B. Data transformation and preprocessing Factors such as checkup date, starting insulin treatment date, termination of insulin treatment date and ages of diagnosis for family member are excluded, since these information are irrelevant to our study. Also, some other factors like plasma glucose concentration, 1 hours after a 75g glucose drink, 2 hours after a 75g glucose drink are excluded, which are used to determine disease and no-disease groups. Prior to the identification of significant environmental factors for Type-2 diabetes, we transformed some of the environmental factors in our experiment data into categorical data. That is, each of such environmental factors as albumin, age, and residence years, which have continuous data, was divided into four ordinal values, based on 25%, 50%, 75% and 100% percentiles. Since, the marital state carries additional extra information, we merged those to one marital state and same procedure is applied for religion state. Also, some other *Correspondence and requests for materials should be addressed to M. Shin ( [email protected]) IEEE ISCE 2014 1569946579 1

[IEEE 2014 International Symposium on Consumer Electronics (ICSE) - JeJu Island, South Korea (2014.6.22-2014.6.25)] The 18th IEEE International Symposium on Consumer Electronics (ISCE

  • Upload
    miyoung

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2014 International Symposium on Consumer Electronics (ICSE) - JeJu Island, South Korea (2014.6.22-2014.6.25)] The 18th IEEE International Symposium on Consumer Electronics (ISCE

Developing Disease Risk Prediction Model Based on Environmental Factors

Mingyu Pak1

School of Electronics Engineering Kyungpook National University

Daegu, Korea [email protected]

Miyoung Shin2*

School of Electronics Engineering Kyungpook National University

Daegu, Korea [email protected]

Abstract— Analyzing the effects of various environmental factors on human diseases is one of the important issues in recent bioinformatics studies. In this paper we investigate several environmental factors regarding Type-2 diabetes and select some of them for develop an analytical model of disease risk prediction. For the selection of significant factors, we first preprocessed all the environmental factors into categorical values and then calculated the max/min odds ratios of all the categorized environmental factors. After that, we chose the top-n ranked factors as input features for the prediction model. The disease risk prediction model was developed with SVM classifiers, where training data were built based on Ansan/Ansung Cohort 2 Data obtained from the Korean National Institute of Health (KNIH). Here the data imbalanced problem was occurred in training data, which can be often observed in reality. Thus, to handle this problem, we regenerated the training data by using the SMOTE approach and used them for disease risk prediction modeling. For model evaluation, the proposed method was employed to predict the risk of Type-2 diabetes disease. The experiment results showed that our SVM classifiers based on selective environmental factors could produce very comparable results to the prediction model with genetic factors in forecasting the risk of specific disease.

Keywords—Environmental-wide association study; disease risk prediction; SVM classifiers;

I. INTRODUCTION Understanding and explaining the complex mechanisms in

human disease is one of the fundamental challenge in recent genetic studies. According to earlier studies [1], the underlying mechanisms in many human diseases can be affected by genetic or environmental factors. Although environmental factors have significant role in disease-causing mechanism, so far many works have mainly focused on exploring the effects of genetic factors such as SNP genotypes or gene expression. Recently, however, several studies [2,3] showed the possibilities of affecting some environmental factors on specific diseases, along with genetic factors. Thus, the environmental-wide association studies (EWAS) are becoming attractive to many research groups.

In this paper, our aim is to identify the significant environmental factors which are highly associated with a particular disease occurrence of Type-2 diabetes, and to construct a disease risk prediction model based on the selected

environmental factors. For this purpose, we used Ansung/Ansan Cohort 2 Data to develop Type-2 diabetes risk prediction model, where the definition of Type-2 diabetes was followed from the standard procedure from World Health Organization (WHO)**.

** “Definition, Diagnosis and classification of diabetes mellitus and its complications. Part 1 :Diagnosis and classification of diabetes mellitus,” Geneva, World Health Organization, 1999(WHO/NCD/NCS/99.2).

II. METHODS

A. Experiment data For the development of disease risk prediction model, we

used the Ansung/Ansan Cohort 2 Data, which was obtained from the Korean National Institute of Health (KNIH). This dataset is regarding the 8,843 individuals who live in the area of Ansung or Ansan in the province of Kyunggi, Korea. Each individual data consists of SNP genotypes, which was produced by SNP chip (Affymetrix Human Mapping 500K), and 37 environmental factors including gender, age, area, job, handedness, religion, income, a plasma glucose concentration, the relation of diagnosed family, blood urea nitrogen (BUN), creatine, and etc. In this work, our interests are to examine the significance of environmental factors associated with Type-2 diabetes, not that of genetic factors.

B. Data transformation and preprocessing Factors such as checkup date, starting insulin treatment date,

termination of insulin treatment date and ages of diagnosis for family member are excluded, since these information are irrelevant to our study. Also, some other factors like plasma glucose concentration, 1 hours after a 75g glucose drink, 2 hours after a 75g glucose drink are excluded, which are used to determine disease and no-disease groups.

Prior to the identification of significant environmental factors for Type-2 diabetes, we transformed some of the environmental factors in our experiment data into categorical data. That is, each of such environmental factors as albumin, age, and residence years, which have continuous data, was divided into four ordinal values, based on 25%, 50%, 75% and 100% percentiles. Since, the marital state carries additional extra information, we merged those to one marital state and same procedure is applied for religion state. Also, some other

*Correspondence and requests for materials should be addressed to M. Shin ( [email protected])

IEEE ISCE 2014 1569946579

1

Page 2: [IEEE 2014 International Symposium on Consumer Electronics (ICSE) - JeJu Island, South Korea (2014.6.22-2014.6.25)] The 18th IEEE International Symposium on Consumer Electronics (ISCE

environmental factors like education, house type and monthly income were incorporated into a single category of social economic status (SES), just as in [4], which has one of the four ordinal values chosen in the same way as above. Meanwhile, the environmental factors of creatinine and BUN were merged into a single factor of BUN/Creatine ratio, just as in [5], which has one of the four ordinal values. Further, the factor of diagnosed family occurrence was transformed to be binary-valued of 1 or 2, where the value of 1 indicates that there is no diagnosed family for Type-2 diabetes, while the value of 2 means that there are one or more diagnosed family. After data transformation step, 15 environmental factors be remained out of 37 environmental factors.

Now, for each categorical value of all the environmental factors, we calculated odds ratios between case (disease) and control (no-disease) groups and replaced original categorical value with the corresponding odds ratio. Here a larger value of odds ratio indicates the high risk of disease (Type-2 diabetes).

C. Selection of significant environmental factors To identify the significant environmental factors, we

obtained the ratios of max/min odds ratios for each of all the environmental factors, and used them to select the most 6 significant factors with the threshold value of 2.

D. Disease risk prediction modeling Regarding 8,842 samples, we first defined case (disease)

group and control (no-disease) group based on the glucose level information, according to the WHO standard for Type-2 diabetes disease. Specifically, if the plasma glucose concentration of a sample is higher than 126mg/dl, we considered it as disease group and otherwise, considered it as no-disease group [6]. As results, we obtained 331 samples for the case and 8,512 samples for the control, which can lead to the data imbalanced problem in developing classification models. That is, since training data are not balanced to have much more control samples than the case, the resulting classification models can be very much biased to the class of control samples. Thus, the sensitivity of the classifier becomes very low while the specificity keeps very high. To handle this problem, we applied the synthetic minority over-sampling technique (SMOTE) [7] for data regeneration. The disease risk prediction model was constructed by developing SVM classifiers which were evaluated with the 10-fold cross validation method.

III. RESULTS.

A. Siginificant environmental factors Figure 1 shows the significance of all the environmental

factors in Ansung/Ansan Cohort 2 Data for Type-2 diabetes. Here it is observed that the occurrence of diagnosed family is the most significant environmental factor to predict the risk of Type-2 diabetes. Also, other five environmental factors like area, sex, SES, albumin, and Bun/Creatine ratio are also somewhat associated with the occurrence of Type-2 diabetes. Thus, these six most significant environmental factors were used as input features for developing SVM classifiers

Fig. 1. Significance of environmental factors

B. Performance of disease risk prediction Our disease risk prediction models with SVM classifier

showed the performance of 65.97% accuracy, % sensitivity, and % specificity, as shown in Table 2. Comparing with the results of earlier works in [8] which was performed with the same data, the performance of our model is very comparable to that of their models based on genetic factors.

IV. CONCLUSIONS. Up to now, we investigated the effects of various

environmental factors on the disease of Type-2 diabetes. The experimental results showed that some carefully chosen environmental factors could be very effective in predicting the risk of specific disease such as Type-2 diabetes. In addition, the prediction model only with selective environmental factors could show very comparable performance to the model only with genetic factors. Thus, in near future, we plan to perform an integrative approach using both genetic factors and environmental factors for disease risk prediction.

V. ACKNOWLEDGMENT This work was supported by the IT R&D program of MISP/KEIT. [10041145, Self-Organized Software platform (SoSp) for Welfare Devices].

VI. REFERENCES [1] Murea M, Ma L, Freedman BI. “Genetic and environmental factors

associated with type 2 diabetes and diabetic vascular complications” Rev Diabet Stud. 2012 May 10; 9(1):6-22

[2] Hall MA, Dudek SM, Goodloe R, Crawford DC, Pendergrass SA, et al.”Environmental-wide association study (EWAS) for type 2 diabetes in the marshfield personalized medicine research project biobank,”Pac Symp Biocomput. 2014:200-11.

[3] Patel CJ, Bhattacharya J, Butte AJ. “An Environmental-Wide Association Study (EWAS) on type 2 diabetes mellitus” PLoS One. 2010 May 20;5(5):e10746.

[4] Nolasco A, Quesada JA, Moncho J, Melchor I, Pereyra-Zamora P,et al.”Trends in socioeconomic inequalities in amenable mortality in urban areas of Spanish cities, 1996-2007,” BMC Public Health. 2014 Apr 1;14(1):299

[5] Riccardi A, Chiarbonello B, Minuto P, Guiddo G, Corti L, Lerza R. “Identification of the hydration state in emergency patients: correlation between caval index and BUN/creatinine ratio,” Eur Rev Med Pharmacol Sci. 2013 Jul;17(13):1800-3

[6] Chawla N, Bowyer K, Hall L, Kegelmeyer W. “SMOTE: synthetic minority over-sampling technique,” J Art Intell Res 2002, 16:321-357.

[7] Ban HJ,Heo JY, Oh KS, Park KJ. “Identificatio of type 2 diabetes-associated combination of SNPs using support vector machine,” BMC Genet. 2010 Apr 23;11:26

2