1095.Nurita Andayani 3

8/12/2019 1095.Nurita Andayani 3

1/8

1095

LOGIT MODEL TO PREDICT DIABETES MELLITUS INEMPLOYEE

Nurita Andayani1) and Moordiani2)

1)Statistics, Faculty of Pharmacy, Pancasila University2)Pharmacology, Faculty of Pharmacy, Pancasila University

e-mail: [email protected]

Abstract. Diabetes mellitus is a metabolic disorder characterized by chronichyperglycemia with disturbances of carbohydrate, fat and protein metabolismresulting from defects in insulin secretion, insulin action, or both. According to thedata achieved from WHO, more than 220 million people in the world suffered from

diabetes mellitus, where 17 million people are Indonesian (8.6% of Indonesiaspopulation) and we can say that after India, China, and USA, Indonesia is in thefourth order in the world. Diabetes mellitus patients commonly felt fatigue, itchy inthe skin, paresthesia, blurry vision or other. If this disease develops, it will have aneffect on complication such as the increasing risk of blindness, stroke, and death.

And this condition could reduce the working productivity in companies employees.Thus, it is important to know the significant indicator to predict the probability ofthe existence of this disease using logistic regression or logit model. From the dataanalysis, we achieved logit model:From the data analysis, we achieved logit model:

BMItriglageageagesex XXXXXx

x057.0004.0719.0814.1991.2718.0925.3

)(1

)(ln )3()2()1()1(

This model significantly explained that diabetes disease in employees depends onsex, age, triglyceride, and BMI (Body Mass Index) or IMT. From this model

Keywords: Regression, logit, diabetes

1 IntroductionDiabetes mellitus is a metabolic disorder characterized by chronic hyperglycemiawith disturbances of carbohydrate, fat and protein metabolism resulting fromdefects in insulin secretion, insulin action, or both. The classification of diabetes isbased on aetiological types. Type 1 indicates the processes of beta-cell destruction

that may ultimately lead to diabetes in which insulin is required for survival. Type2 diabetes is characterized by disorders of insulin action and /or insulin secretion.

The third category, "other specific types of diabetes," includes diabetes caused by aspecific and identified underlying defect, such as genetic defects or diseases of theexocrine pancreas. The latest WHO Global Burden of Disease estimates theworldwide burden of diabetes in adults to be around 173 million in the year 2002(World Health Organization, 1999).

According to the data achieved from WHO, more than 220 million people in theworld suffered from diabetes mellitus, where 17 million people are Indonesian

Proceedings of the Third International Conference on Mathematics and Natural Sciences(ICMNS 2010)


2/8

Nurita Andayani, Moordiani

1096

(8.6% of Indonesias population) and we can say that after India, China, and USA,

Indonesia is in the fourth order in the world. The diabetes epidemic is acceleratingin the developing world, with an increasing proportion of affected people in youngerage groups. Thus, we need to prevent and control the increase of this disease. Theaim of this study is to predict the probability of the existence of this disease inemployees at companies. However, this study should serve only as a prediction forthe occurence of diabetes based on the data we have.

Diabetes patients commonly felt fatigue, itchy in the skin, paresthesia, blurry

vision or other. If this disease develops, it will have an effect on complications suchas the increasing risk of blindness, stroke, and death. Diabetes in employees couldreduce their working productivity and it will be on companies responsibility tocarry if this happend continously. Generally, the company would ask the employeecandidates to undergo the medical check up which covers body weight/height,cholesterol, body mass index and others. In this study, we want to predict the

probability of someone to have diabetes in the future based on their data such assex, age, cholesterol (Low Density Lipoprotein or LDL level, High DensityLipoprote in or HDL, triglyceride), and Body Mass Index (BMI). These data had beenprocessed using logistic regression.

Logistic regression (sometimes called logistic model or logit model) is a statisticalmethod for describing the relationship between response variable and explanatoryvariable. The response variable has only two values, for instance : success orfailure, live or die, acceptable or not. The explanatory variables can be eithercategorical or quantitative (Hosmer and Lemeshow). Logistic Regression have thesame meaning with multiple regression analysis, but response variable on logisticregression is a dummy variable (0 and 1). This is a modeling procedure to describethe relationship between response variable (Y) which is categorical variable and oneor more predictor variables (X), either categorical or continuous variables. Forexample, suppose the response variable consist of two categorical variables, whichY = 1 as success and Y = 0 as failed, then we can apply the binary logisticregression for the logistic regression method. For one research object, a condition

with two categorical variables causing y have Bernoulli distribution. Distribution of

probability function for y with as a parameter isyy

yYP 1)1()(

where y = 0 and 1. Then probability to each category is )1(YP and

1)0(YP where 10,)(yE . Generally logistic regression forprobability model involving some predictor variables (x) can be formulated as :

)...(

)...(

22110

22110

1)(

pp

pp

xxx

xxx

e

exyE (1)

)(x is a non linear function. Hence, we need to use logit transformation to get a

linear function so we can see the relationship between response variable ordependent variable (y) and predictor variables or independent variables (x). Logit

form from )(x mentioned as )(xg :

)(1

)(ln)(

x

xxg (2)


3/8

Logit Model to Predict Diabetes Mellitus in Employee

1097

After equation (1) substituted into equation (2) then :

ppxxxx

x...

)(1

)(ln 22110

(3)

2 DataThis research uses 2051 data taken from private laboratory in Surabaya,Indonesia. The respondent data were chosen from people who work as an employeeon their company. Act as a response variable is diabetes (DM = 1) and non diabetes

(DM = 0). Act as predictor variables are sex (female = 1 and male = 0), age range(less than or equal to 30 years old = 1, 31-40 years old = 2, 41-50 years old = 3,and more than 50 years old= 4), total cholesterol, LDL, triglyceride, and body massindex (BMI). The result using SPSS program for binary logistic regression weregiven below :

Table 1. Number of cases in model

All 2051 cases were included in the analysis and no missing cases to analysis(table 1.). And table 2 shows dependent variable encoding where interval value nonDM (non diabetes) is 0 and interval value DM (diabetes) is 1. The Nagelkerke RSquare shows that about 23.2% of the variation in the outcome variable (DM) isexplained by this logistic model.

Table 2. Predicted outcome coding

Table 3. Categorical variables coding

Case Processing Sum mary

2051 100.0

0 .0

2051 100.0

0 .0

2051 100.0

Unwe ighted Casesa

Included in A nalysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent

If w eight is in ef fect, s ee classif ication table fo r the totalnumber of cases.

a.

epe ndent Variable Encoding

0

1

Original Value

non DM

DM

Internal Value


4/8


1098

Table 4. Amount of variation explained by the model

Table 5. Model discrimination

The Wald estimates in table 6 give the importance of the contribution of eachvariable in the model. The higher the value, the more important it is. If weinterest in predictor model then sex, age, triglyceride, and BMI are important riskfactors to having diabetes (DM), with p-values of 0.018, 0.000, 0.000, and 0.013

where they are less than significant level 0.05. Because LDL and HDL in model notsignificantly to predict diabetes then they are omitted from model althoughmulticolinearity is not shown.

Categorical Variables Codings

572 1.000 .000 .000

592 .000 1.000 .000

630 .000 .000 1.000

257 .000 .000 .000

1258 1.000

793 .000

50 tahun

age

range

male

female

Sex

Frequency (1) (2) (3)

Parameter coding

Model Summ ary

998.098 .104 .232

Step

1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Classification Tablea

1865 5 99.7

174 7 3.991.3

Observed

non DM

DM

DM

Overall Percentage

Step 1

non DM DM

DM

Percentage

Correct

Predicted

The cut value is .500a.


5/8


1099

Table 6. Estimates of the logistic regression model

Table 7. Correlation matrix for Diabetes model

New model (table 10) showed that there is not any variable not significant in model,all p-value (sig.) less than 5%. The Exp(B) gives the Odds Ratios. Since triglycerideis a quantitative numerical variable, an increase in one-level in triglyceride has a0.4% increase in odds of having diabetes. This 0.4% is obtained by taking Exp(B)

for triglyceride1. Male compared to female is 2.051 (95% CI 1.283 to 3.278) timesmore likely to have diabetes. For age (age range) 31-40 years old compared to lessthan equal 30 years old is 0.05 (95% CI 0.021 to 0.119) times less likely to havediabetes, age 41-50 years old compared to 31-40 years old and less than equal 30

years is 0.163 (95% CI 0.099 to 0.268) times less likely to have diabetes, and agemore than 50 years old compared to other age is 0.487 (95% CI 0.334 to 0.710)less likely to have diabetes and an increase in one-level in body mass index (BMI)has 5.8% increase in odds of having diabetes. This model analysis has a little bitdifferent with general research which showed that increasing age may increase therisk to have diabetes in the future, but we need to consider another factor, diet,lifestyle, or genetical properties of someone for instance. In the other hand thisdata research was taken from different persons that might be result in different

figure of diabetes probability model with the general theory.

Table 8. Amount of variation explained by the model after HDL and LDL omitted

Variables in the Equation

.628 .265 5.645 1 .018 1.875 1.116 3.148

78.465 3 .000

-2.972 .443 44.932 1 .000 .051 .021 .122

-1.809 .256 49.994 1 .000 .164 .099 .271

-.715 .193 13.751 1 .000 .489 .335 .714

.054 .022 6.168 1 .013 1.056 1.012 1.102

.001 .002 .370 1 .543 1.001 .997 1.006

-.008 .011 .518 1 .472 .992 .972 1.013

.004 .001 25.166 1 .000 1.004 1.002 1.005

-3.625 .936 14.983 1 .000 .027

SEX(1)

AGE

AGE(1)

AGE(2)

AGE(3)

BMI

LDL

HDL

TRIGL

Constant

Step

1a

B S.E. Wald df Sig. Exp(B) Low er Upper

95.0% C.I.for EXP(B)

Var iable(s ) entered on step 1: SEX, AGE, BMI, LDL, HDL, TRIGL.a.

Correlation Matrix

1.000 -.430 -.127 -.154 -.105 -.636 -.258 -.685 -.358

-.430 1.000 .090 .093 .040 .024 -.102 .423 -.020

-.127 .090 1.000 .215 .269 -.008 .106 .023 .041

-.154 .093 .215 1.000 .463 -.044 .102 .057 .001

-.105 .040 .269 .463 1.000 -.090 .055 .012 .003

-.636 .024 -.008 -.044 -.090 1.000 -.055 .113 -.041

-.258 -.102 .106 .102 .055 -.055 1.000 -.116 .076

-.685 .423 .023 .057 .012 .113 -.116 1.000 .403

-.358 -.020 .041 .001 .003 -.041 .076 .403 1.000

Constant

SEX(1)

AGE(1)

AGE(2)

AGE(3)

BMI

LDL

HDL

TRIGL

Step

1

Cons tant SEX(1) A GE(1) A GE(2) A GE(3) BMI LDL HDL TRIGL

Model Summary

998.903 .104 .232

Step

1

-2 Log

likelihood

Cox & Snell

R Square

Nagelkerke

R Square


6/8


1100

The correlation values (table 11) among sex, age, triglyceride, and BMI are low but

the correlation between BMI and the constant is rather high (r = -0.894) whichshows some multicolinearity. Our recommendation is to keep the constant term inthe model as it acts as a garbagebin, collecting all unexplained variance in themodel (recall from table 8 that variation in the variables only explains 23.2%).

Table 9. Model discrimination after LDL and HDL omitted

Table 10. Estimates of the logistic regression model after LDL and HDL omitted

Table 11. Correlation matrix for Diabetes model after LDL and HDL omitted

Table 12. Hosmer-Lemeshow test

Classification Tablea

1864 6 99.7

174 7 3.9

91.2

Observed

non DM

DM

DM

Overall Percentage

Step 1

non DM DM

DM

Percentage

Correct

Predicted

The cut value is .500a.

Variables in the Equation

.718 .239 9.015 1 .003 2.051 1.283 3.278

80.973 3 .000

-2.991 .441 46.093 1 .000 .050 .021 .119

-1.814 .254 51.014 1 .000 .163 .099 .268

-.719 .192 13.976 1 .000 .487 .334 .710

.057 .022 6.881 1 .009 1.058 1.014 1.104

.004 .001 33.504 1 .000 1.004 1.002 1.005

-3.925 .600 42.821 1 .000 .020

SEX(1)

AGE

AGE(1)

AGE(2)

AGE(3)

BMI

TRIGL

Constant

Step

1a

B S.E. Wald df Sig. Exp(B) Low er Upper

95.0% C.I.for EXP(B)

Var iable(s ) entered on s tep 1: SEX, AGE, BMI, TRIGL.a.

Correlation Matrix

1.000 -.271 -.119 -.125 -.124 -.894 -.063

-.271 1.000 .096 .084 .042 -.031 -.226

-.119 .096 1.000 .204 .265 -.003 .021

-.125 .084 .204 1.000 .460 -.043 -.041

-.124 .042 .265 .460 1.000 -.087 -.009

-.894 -.031 -.003 -.043 -.087 1.000 -.094

-.063 -.226 .021 -.041 -.009 -.094 1.000

Constant

SEX(1)

AGE(1)

AGE(2)

AGE(3)

BMI

TRIGL

Step

1

Cons tant SEX(1) A GE(1) A GE(2) A GE(3) BMI TRIGL

Hosmer and Leme show Test

5.139 8 .743

Step

1

Chi-square df Sig.


7/8


1101

Hosmer-Lemeshow goodness of fit tells us how closely the observed and predicted

probabilities match. The null hypothesis is the model fits and ap value >0.05 isexpected (Table 12). The overall accuracy of this model to predict subjects havingdiabetes (with a predicted probability of 0.5 or greater) is 91.3% (Table 5). Thesensitivity is given by 3.9% and the specificity is 99.7%. Positive predictive value(PPV) = 7/13 = 46.2% and negative predictive value (NPV) = 1864/2038 = 91.4%.

For example, we have a male, 41-year-old, 167 for triglyceride level, and 30.4 forBMI which gives the Probability (diabetes) = 0.068; very unlikely that this subject

has diabetes and the NPV tells us that we are 91.4% confident. Let us takeanother example, a male, 30-year-old, 500 triglyceride level, and 33.7 for BMIwhich gives the Probability (diabetes) = 0.68; very likely that this subject hasdiabetes and the PPV gives a 46.2%confident.

From the data analysis, we achieved logit model:

BMItriglageageagesex XXXXXx

x 057.0004.0719.0814.1991.2718.0925.3)(1

)(ln )3()2()1()1(

This model significantly explained that diabetes disease in employees depends onsex, age, triglyceride, and BMI (Body Mass Index) or IMT.

References

Agresti, A, 1990, Categorical Data Analysis, John Wiley and Sons.Inc, New York.

Al-khazrajy, LA., Raheem, YA. & Hanoon, YK., 2010, Sex Differences in the Impactof Body Mass Index (BMI) and Waist/Hip (W/H) Ratio on Patients withMetabolic Risk Factors in Baghdad. Global Journal of Health ScienceVol. 2,

No. 2.

Brunham, LR., Kruit, JK., Verchere, CB., and Hayden, MR., 2008, Cholesterol inIslet Dysfunction and Type 2 Diabetes, The Journal of Clinical Investigation,Volume 118 Number 2.

Chan, YH., 2004, Biostatistics 202: Logistic regression analysis, Singapore MedJournal, Vol. 45(4) : 149.

Federal Bureau of Prisons, 2009, Management of Diabetes, Clinical PracticeGuideline.

Friel, C.M., 1998, Probit/Logit Analysis, Criminal Justice Center, Sam HoustonState University.

Hao, M., Head, WS., Gunawardana, SC., Hasty, AH., and Piston, DW., 2007, DirectEffect of Cholesterol on Insulin Secretion, A Novel Mechanism forPancreatic -Cell Dysfunction. Diabetes, Vol. 56.


8/8


1102

Hosmer, DW. and Lemeshow, S., Applied Logistic Regression. [online] Available at: [Accessed 23October 2010].

Laakso, M., Sarlund, H., Ehnholm, C., Voutilainen, E., Aro, A., and K. Py6r~ila.,1987, Re lationship between postheparin plasma lipases and high-densitylipoprotein cholesterol in different types of diabetes. Diabetologia30:703-706.

Scheffer, PG., Teerlink, T., and Heine, RJ., 2005, Clinical significance of thephysicochemical properties of LDL in type 2 diabetes, Diabetologia, 48:808816.

Poedjiati, SA., 2010, Perbandingan Ketepatan Model Logit dan Probit Untuk

Memprediksi Munculnya Penyakit Hipertensi pada Karyawan Perusahaan.Prosiding Seminar Nasional Basic Science VII, vol. 4:212.

Vasisht, A.K., 2000, Logit and Probit Analysis, I.A.S.R.I., Library Avenue, NewDelhi.

Wild S, Roglic G, Sicree R, Green A, King H., 2003, Global burden of diabetesmellitus in the year 2000. Global Burden of Disease, Geneva: WHO.

World Health Organization, 1999, Definition, Diagnosis and Classification ofDiabetes Mellitus and its Complications. Report of a WHO Consultation. Geneva: World Health Organization.

World Health Organization, 2003, Screening for Type 2 Diabetes. Report of a WorldHealth Organization and International Diabetes Federation meeting. Geneva:World Health Organization.
http://www.indiana.edu/~lceiub/PY206F05/Logistic.pdfhttp://www.indiana.edu/~lceiub/PY206F05/Logistic.pdf

Documents

1095.Nurita Andayani 3