Upload
fikri17
View
216
Download
0
Embed Size (px)
Citation preview
8/12/2019 1095.Nurita Andayani 3
1/8
1095
LOGIT MODEL TO PREDICT DIABETES MELLITUS INEMPLOYEE
Nurita Andayani1) and Moordiani2)
1)Statistics, Faculty of Pharmacy, Pancasila University2)Pharmacology, Faculty of Pharmacy, Pancasila University
e-mail: [email protected]
Abstract. Diabetes mellitus is a metabolic disorder characterized by chronichyperglycemia with disturbances of carbohydrate, fat and protein metabolismresulting from defects in insulin secretion, insulin action, or both. According to thedata achieved from WHO, more than 220 million people in the world suffered from
diabetes mellitus, where 17 million people are Indonesian (8.6% of Indonesiaspopulation) and we can say that after India, China, and USA, Indonesia is in thefourth order in the world. Diabetes mellitus patients commonly felt fatigue, itchy inthe skin, paresthesia, blurry vision or other. If this disease develops, it will have aneffect on complication such as the increasing risk of blindness, stroke, and death.
And this condition could reduce the working productivity in companies employees.Thus, it is important to know the significant indicator to predict the probability ofthe existence of this disease using logistic regression or logit model. From the dataanalysis, we achieved logit model:From the data analysis, we achieved logit model:
BMItriglageageagesex XXXXXx
x057.0004.0719.0814.1991.2718.0925.3
)(1
)(ln )3()2()1()1(
This model significantly explained that diabetes disease in employees depends onsex, age, triglyceride, and BMI (Body Mass Index) or IMT. From this model
Keywords: Regression, logit, diabetes
1 IntroductionDiabetes mellitus is a metabolic disorder characterized by chronic hyperglycemiawith disturbances of carbohydrate, fat and protein metabolism resulting fromdefects in insulin secretion, insulin action, or both. The classification of diabetes isbased on aetiological types. Type 1 indicates the processes of beta-cell destruction
that may ultimately lead to diabetes in which insulin is required for survival. Type2 diabetes is characterized by disorders of insulin action and /or insulin secretion.
The third category, "other specific types of diabetes," includes diabetes caused by aspecific and identified underlying defect, such as genetic defects or diseases of theexocrine pancreas. The latest WHO Global Burden of Disease estimates theworldwide burden of diabetes in adults to be around 173 million in the year 2002(World Health Organization, 1999).
According to the data achieved from WHO, more than 220 million people in theworld suffered from diabetes mellitus, where 17 million people are Indonesian
Proceedings of the Third International Conference on Mathematics and Natural Sciences(ICMNS 2010)
8/12/2019 1095.Nurita Andayani 3
2/8
Nurita Andayani, Moordiani
1096
(8.6% of Indonesias population) and we can say that after India, China, and USA,
Indonesia is in the fourth order in the world. The diabetes epidemic is acceleratingin the developing world, with an increasing proportion of affected people in youngerage groups. Thus, we need to prevent and control the increase of this disease. Theaim of this study is to predict the probability of the existence of this disease inemployees at companies. However, this study should serve only as a prediction forthe occurence of diabetes based on the data we have.
Diabetes patients commonly felt fatigue, itchy in the skin, paresthesia, blurry
vision or other. If this disease develops, it will have an effect on complications suchas the increasing risk of blindness, stroke, and death. Diabetes in employees couldreduce their working productivity and it will be on companies responsibility tocarry if this happend continously. Generally, the company would ask the employeecandidates to undergo the medical check up which covers body weight/height,cholesterol, body mass index and others. In this study, we want to predict the
probability of someone to have diabetes in the future based on their data such assex, age, cholesterol (Low Density Lipoprotein or LDL level, High DensityLipoprote in or HDL, triglyceride), and Body Mass Index (BMI). These data had beenprocessed using logistic regression.
Logistic regression (sometimes called logistic model or logit model) is a statisticalmethod for describing the relationship between response variable and explanatoryvariable. The response variable has only two values, for instance : success orfailure, live or die, acceptable or not. The explanatory variables can be eithercategorical or quantitative (Hosmer and Lemeshow). Logistic Regression have thesame meaning with multiple regression analysis, but response variable on logisticregression is a dummy variable (0 and 1). This is a modeling procedure to describethe relationship between response variable (Y) which is categorical variable and oneor more predictor variables (X), either categorical or continuous variables. Forexample, suppose the response variable consist of two categorical variables, whichY = 1 as success and Y = 0 as failed, then we can apply the binary logisticregression for the logistic regression method. For one research object, a condition
with two categorical variables causing y have Bernoulli distribution. Distribution of
probability function for y with as a parameter isyy
yYP 1)1()(
where y = 0 and 1. Then probability to each category is )1(YP and
1)0(YP where 10,)(yE . Generally logistic regression forprobability model involving some predictor variables (x) can be formulated as :
)...(
)...(
22110
22110
1)(
pp
pp
xxx
xxx
e
exyE (1)
)(x is a non linear function. Hence, we need to use logit transformation to get a
linear function so we can see the relationship between response variable ordependent variable (y) and predictor variables or independent variables (x). Logit
form from )(x mentioned as )(xg :
)(1
)(ln)(
x
xxg (2)
8/12/2019 1095.Nurita Andayani 3
3/8
Logit Model to Predict Diabetes Mellitus in Employee
1097
After equation (1) substituted into equation (2) then :
ppxxxx
x...
)(1
)(ln 22110
(3)
2 DataThis research uses 2051 data taken from private laboratory in Surabaya,Indonesia. The respondent data were chosen from people who work as an employeeon their company. Act as a response variable is diabetes (DM = 1) and non diabetes
(DM = 0). Act as predictor variables are sex (female = 1 and male = 0), age range(less than or equal to 30 years old = 1, 31-40 years old = 2, 41-50 years old = 3,and more than 50 years old= 4), total cholesterol, LDL, triglyceride, and body massindex (BMI). The result using SPSS program for binary logistic regression weregiven below :
Table 1. Number of cases in model
All 2051 cases were included in the analysis and no missing cases to analysis(table 1.). And table 2 shows dependent variable encoding where interval value nonDM (non diabetes) is 0 and interval value DM (diabetes) is 1. The Nagelkerke RSquare shows that about 23.2% of the variation in the outcome variable (DM) isexplained by this logistic model.
Table 2. Predicted outcome coding
Table 3. Categorical variables coding
Case Processing Sum mary
2051 100.0
0 .0
2051 100.0
0 .0
2051 100.0
Unwe ighted Casesa
Included in A nalysis
Missing Cases
Total
Selected Cases
Unselected Cases
Total
N Percent
If w eight is in ef fect, s ee classif ication table fo r the totalnumber of cases.
a.
epe ndent Variable Encoding
0
1
Original Value
non DM
DM
Internal Value
8/12/2019 1095.Nurita Andayani 3
4/8
Nurita Andayani, Moordiani
1098
Table 4. Amount of variation explained by the model
Table 5. Model discrimination
The Wald estimates in table 6 give the importance of the contribution of eachvariable in the model. The higher the value, the more important it is. If weinterest in predictor model then sex, age, triglyceride, and BMI are important riskfactors to having diabetes (DM), with p-values of 0.018, 0.000, 0.000, and 0.013
where they are less than significant level 0.05. Because LDL and HDL in model notsignificantly to predict diabetes then they are omitted from model althoughmulticolinearity is not shown.
Categorical Variables Codings
572 1.000 .000 .000
592 .000 1.000 .000
630 .000 .000 1.000
257 .000 .000 .000
1258 1.000
793 .000
50 tahun
age
range
male
female
Sex
Frequency (1) (2) (3)
Parameter coding
Model Summ ary
998.098 .104 .232
Step
1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Classification Tablea
1865 5 99.7
174 7 3.991.3
Observed
non DM
DM
DM
Overall Percentage
Step 1
non DM DM
DM
Percentage
Correct
Predicted
The cut value is .500a.
8/12/2019 1095.Nurita Andayani 3
5/8
Logit Model to Predict Diabetes Mellitus in Employee
1099
Table 6. Estimates of the logistic regression model
Table 7. Correlation matrix for Diabetes model
New model (table 10) showed that there is not any variable not significant in model,all p-value (sig.) less than 5%. The Exp(B) gives the Odds Ratios. Since triglycerideis a quantitative numerical variable, an increase in one-level in triglyceride has a0.4% increase in odds of having diabetes. This 0.4% is obtained by taking Exp(B)
for triglyceride1. Male compared to female is 2.051 (95% CI 1.283 to 3.278) timesmore likely to have diabetes. For age (age range) 31-40 years old compared to lessthan equal 30 years old is 0.05 (95% CI 0.021 to 0.119) times less likely to havediabetes, age 41-50 years old compared to 31-40 years old and less than equal 30
years is 0.163 (95% CI 0.099 to 0.268) times less likely to have diabetes, and agemore than 50 years old compared to other age is 0.487 (95% CI 0.334 to 0.710)less likely to have diabetes and an increase in one-level in body mass index (BMI)has 5.8% increase in odds of having diabetes. This model analysis has a little bitdifferent with general research which showed that increasing age may increase therisk to have diabetes in the future, but we need to consider another factor, diet,lifestyle, or genetical properties of someone for instance. In the other hand thisdata research was taken from different persons that might be result in different
figure of diabetes probability model with the general theory.
Table 8. Amount of variation explained by the model after HDL and LDL omitted
Variables in the Equation
.628 .265 5.645 1 .018 1.875 1.116 3.148
78.465 3 .000
-2.972 .443 44.932 1 .000 .051 .021 .122
-1.809 .256 49.994 1 .000 .164 .099 .271
-.715 .193 13.751 1 .000 .489 .335 .714
.054 .022 6.168 1 .013 1.056 1.012 1.102
.001 .002 .370 1 .543 1.001 .997 1.006
-.008 .011 .518 1 .472 .992 .972 1.013
.004 .001 25.166 1 .000 1.004 1.002 1.005
-3.625 .936 14.983 1 .000 .027
SEX(1)
AGE
AGE(1)
AGE(2)
AGE(3)
BMI
LDL
HDL
TRIGL
Constant
Step
1a
B S.E. Wald df Sig. Exp(B) Low er Upper
95.0% C.I.for EXP(B)
Var iable(s ) entered on step 1: SEX, AGE, BMI, LDL, HDL, TRIGL.a.
Correlation Matrix
1.000 -.430 -.127 -.154 -.105 -.636 -.258 -.685 -.358
-.430 1.000 .090 .093 .040 .024 -.102 .423 -.020
-.127 .090 1.000 .215 .269 -.008 .106 .023 .041
-.154 .093 .215 1.000 .463 -.044 .102 .057 .001
-.105 .040 .269 .463 1.000 -.090 .055 .012 .003
-.636 .024 -.008 -.044 -.090 1.000 -.055 .113 -.041
-.258 -.102 .106 .102 .055 -.055 1.000 -.116 .076
-.685 .423 .023 .057 .012 .113 -.116 1.000 .403
-.358 -.020 .041 .001 .003 -.041 .076 .403 1.000
Constant
SEX(1)
AGE(1)
AGE(2)
AGE(3)
BMI
LDL
HDL
TRIGL
Step
1
Cons tant SEX(1) A GE(1) A GE(2) A GE(3) BMI LDL HDL TRIGL
Model Summary
998.903 .104 .232
Step
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
8/12/2019 1095.Nurita Andayani 3
6/8
Nurita Andayani, Moordiani
1100
The correlation values (table 11) among sex, age, triglyceride, and BMI are low but
the correlation between BMI and the constant is rather high (r = -0.894) whichshows some multicolinearity. Our recommendation is to keep the constant term inthe model as it acts as a garbagebin, collecting all unexplained variance in themodel (recall from table 8 that variation in the variables only explains 23.2%).
Table 9. Model discrimination after LDL and HDL omitted
Table 10. Estimates of the logistic regression model after LDL and HDL omitted
Table 11. Correlation matrix for Diabetes model after LDL and HDL omitted
Table 12. Hosmer-Lemeshow test
Classification Tablea
1864 6 99.7
174 7 3.9
91.2
Observed
non DM
DM
DM
Overall Percentage
Step 1
non DM DM
DM
Percentage
Correct
Predicted
The cut value is .500a.
Variables in the Equation
.718 .239 9.015 1 .003 2.051 1.283 3.278
80.973 3 .000
-2.991 .441 46.093 1 .000 .050 .021 .119
-1.814 .254 51.014 1 .000 .163 .099 .268
-.719 .192 13.976 1 .000 .487 .334 .710
.057 .022 6.881 1 .009 1.058 1.014 1.104
.004 .001 33.504 1 .000 1.004 1.002 1.005
-3.925 .600 42.821 1 .000 .020
SEX(1)
AGE
AGE(1)
AGE(2)
AGE(3)
BMI
TRIGL
Constant
Step
1a
B S.E. Wald df Sig. Exp(B) Low er Upper
95.0% C.I.for EXP(B)
Var iable(s ) entered on s tep 1: SEX, AGE, BMI, TRIGL.a.
Correlation Matrix
1.000 -.271 -.119 -.125 -.124 -.894 -.063
-.271 1.000 .096 .084 .042 -.031 -.226
-.119 .096 1.000 .204 .265 -.003 .021
-.125 .084 .204 1.000 .460 -.043 -.041
-.124 .042 .265 .460 1.000 -.087 -.009
-.894 -.031 -.003 -.043 -.087 1.000 -.094
-.063 -.226 .021 -.041 -.009 -.094 1.000
Constant
SEX(1)
AGE(1)
AGE(2)
AGE(3)
BMI
TRIGL
Step
1
Cons tant SEX(1) A GE(1) A GE(2) A GE(3) BMI TRIGL
Hosmer and Leme show Test
5.139 8 .743
Step
1
Chi-square df Sig.
8/12/2019 1095.Nurita Andayani 3
7/8
Logit Model to Predict Diabetes Mellitus in Employee
1101
Hosmer-Lemeshow goodness of fit tells us how closely the observed and predicted
probabilities match. The null hypothesis is the model fits and ap value >0.05 isexpected (Table 12). The overall accuracy of this model to predict subjects havingdiabetes (with a predicted probability of 0.5 or greater) is 91.3% (Table 5). Thesensitivity is given by 3.9% and the specificity is 99.7%. Positive predictive value(PPV) = 7/13 = 46.2% and negative predictive value (NPV) = 1864/2038 = 91.4%.
For example, we have a male, 41-year-old, 167 for triglyceride level, and 30.4 forBMI which gives the Probability (diabetes) = 0.068; very unlikely that this subject
has diabetes and the NPV tells us that we are 91.4% confident. Let us takeanother example, a male, 30-year-old, 500 triglyceride level, and 33.7 for BMIwhich gives the Probability (diabetes) = 0.68; very likely that this subject hasdiabetes and the PPV gives a 46.2%confident.
From the data analysis, we achieved logit model:
BMItriglageageagesex XXXXXx
x 057.0004.0719.0814.1991.2718.0925.3)(1
)(ln )3()2()1()1(
This model significantly explained that diabetes disease in employees depends onsex, age, triglyceride, and BMI (Body Mass Index) or IMT.
References
Agresti, A, 1990, Categorical Data Analysis, John Wiley and Sons.Inc, New York.
Al-khazrajy, LA., Raheem, YA. & Hanoon, YK., 2010, Sex Differences in the Impactof Body Mass Index (BMI) and Waist/Hip (W/H) Ratio on Patients withMetabolic Risk Factors in Baghdad. Global Journal of Health ScienceVol. 2,
No. 2.
Brunham, LR., Kruit, JK., Verchere, CB., and Hayden, MR., 2008, Cholesterol inIslet Dysfunction and Type 2 Diabetes, The Journal of Clinical Investigation,Volume 118 Number 2.
Chan, YH., 2004, Biostatistics 202: Logistic regression analysis, Singapore MedJournal, Vol. 45(4) : 149.
Federal Bureau of Prisons, 2009, Management of Diabetes, Clinical PracticeGuideline.
Friel, C.M., 1998, Probit/Logit Analysis, Criminal Justice Center, Sam HoustonState University.
Hao, M., Head, WS., Gunawardana, SC., Hasty, AH., and Piston, DW., 2007, DirectEffect of Cholesterol on Insulin Secretion, A Novel Mechanism forPancreatic -Cell Dysfunction. Diabetes, Vol. 56.
8/12/2019 1095.Nurita Andayani 3
8/8
Nurita Andayani, Moordiani
1102
Hosmer, DW. and Lemeshow, S., Applied Logistic Regression. [online] Available at: [Accessed 23October 2010].
Laakso, M., Sarlund, H., Ehnholm, C., Voutilainen, E., Aro, A., and K. Py6r~ila.,1987, Re lationship between postheparin plasma lipases and high-densitylipoprotein cholesterol in different types of diabetes. Diabetologia30:703-706.
Scheffer, PG., Teerlink, T., and Heine, RJ., 2005, Clinical significance of thephysicochemical properties of LDL in type 2 diabetes, Diabetologia, 48:808816.
Poedjiati, SA., 2010, Perbandingan Ketepatan Model Logit dan Probit Untuk
Memprediksi Munculnya Penyakit Hipertensi pada Karyawan Perusahaan.Prosiding Seminar Nasional Basic Science VII, vol. 4:212.
Vasisht, A.K., 2000, Logit and Probit Analysis, I.A.S.R.I., Library Avenue, NewDelhi.
Wild S, Roglic G, Sicree R, Green A, King H., 2003, Global burden of diabetesmellitus in the year 2000. Global Burden of Disease, Geneva: WHO.
World Health Organization, 1999, Definition, Diagnosis and Classification ofDiabetes Mellitus and its Complications. Report of a WHO Consultation. Geneva: World Health Organization.
World Health Organization, 2003, Screening for Type 2 Diabetes. Report of a WorldHealth Organization and International Diabetes Federation meeting. Geneva:World Health Organization.
http://www.indiana.edu/~lceiub/PY206F05/Logistic.pdfhttp://www.indiana.edu/~lceiub/PY206F05/Logistic.pdf