23
A Novel Analysis and Predictive Risk Model of Chronic Diseases: A Markov Chain Monte Carlo Simulated Study on the Rising Incidence of Type II Diabetes Mellitus in the Youth By Meghna Narayan

Sigma xi presentation

Embed Size (px)

Citation preview

A Novel Analysis and Predictive Risk Model of Chronic

Diseases: A Markov Chain Monte Carlo Simulated Study on

the Rising Incidence of Type II Diabetes Mellitus in the

Youth

By Meghna Narayan

Objective and Motivation

Objective: The study of longevity, mortality, morbidity and demographic risks of Type II

Diabetes Mellitus (T2DM) on the 15-20 year olds using a compartmentalized, statistical and

Markov Chain Monte Carlo simulation based model to understand the impact on the

dependency ratio of the United States.

Motivation: Type II Diabetes is usually associated with older age; however, there is a

growing awareness of its increase, particularly among the youth in the United States. The

study is an approach to build a multi-state Monte Carlo simulated compartmental model to

understand the dynamics of chronic diseases such as Type II diabetes mellitus and its

increasing prevalence among the 15-20 years old age group in the United States by scoring

the risk factors associated with this disease. This study aims at providing a reasonable

argument to reformulate the dependency ratio (ratio of dependent people to the population of

working age) for morbidity adjustments due to labor impairment. The approach modeled here

is a cost-effective way to define the role of multiple risk factors as well as the temporal

progression of this disease.

Methods and Expected Findings

Methods: Aggregate data from several studies and databases such as Policy Map, CDC’s

SEARCH, NHANES, ADA and NIH was used to create a composite view of the prevalence

and growth of T2DM in young adults in the U.S. Data pertaining to morbidity was extracted

and used in a Markov Chain Monte Carlo simulation to fit a Bayesian model to estimate the

co-variants contributing to the increasing prevalence in the T2DM in the recent years within

the context of a three-state compartmental model.

Expected findings: Studies conducted across the U.S. and globally by different

organizations document the increasing prevalence of T2DM amongst the youth. Extensive

literature review suggest the rate of increase is prominent in the 15-20 age in minority

race/ethnicity (i.e. African American, Hispanic, Native American sub cohorts). The model

will help identify the most prominent risk factor contributing to this increasing prevalence.

Anticipated Conclusion

Anticipated conclusion: The study will provide a correlation

to the underlying risk factors leading to T2DM and project the

impact of morbidity on the dependency ratio in the coming

decades

Hypothesis and Research Problem

Hypothesis:

A dependency ratio measures the number of people either very young or very old to work, compared to the number of people within working age. In economics, the dependency ratio is an age-population ratio of those typically not in the labor force (the dependent part) and those typically in the labor force (the productive part). It is used to measure the pressure on productive population. [18]

Unfortunately, the current calculations for dependency ratios does not take into account the morbidity factor of the person eligible to work. Moreover, despite an aging population, those under the age of 16 will continue to constitute the largest "dependent" group in future years. Persons between the ages of 15 and 60 are termed “prime-age” adults for they are in their most productive working years. At this time, most can contribute to labor and income in rural households. When these “prime-age” adults become chronically ill, they change from productive members to members requiring care and medicines, no longer working in the fields or elsewhere. With death or illness, even more profound can be the effects of lost labor and skills.

Research Problem: This project focuses on finding a model to predict the trends of type 2 diabetes in youth, as well as find a method to reformulate the dependency ratio.

Background Information

In this project, studying the risk factors causing a high T2DM prevalence rate is very important because the cost burden, labor impairments, and overall effects of increased morbidity and mortality will have significant socioeconomic consequences on the U.S. population. Therefore, it is crucial to make accurate assessments and predictions about the distribution of T2DM in order to facilitate effective interventions. In addition, this paper focuses on identifying, studying and validating the various factors leading to T2DM and its long term effect on the US economy.

Diseases can be classified as communicable or non-communicable. Epidemiologists and statisticians work towards identifying diseases, their risk factors and distributions that shape subsequent interventions to inhibit the spread of both acute and chronic illnesses.

The lack of accurate data on the basis of which decisions and plans can be made calls for mathematical models and simulations that predict the pattern and frequency of an increasing trend of a non-communicable disease. Various disease trend models have been developed to study epidemic outbreaks in a population.

Mathematical models based on compartmental models of epidemiology help in predicting the transition from one state to another in homogenous populations by means of stochastic or deterministic calculations. In order to model the spread of infectious diseases in a population, it is important to consider the non-homogeneity and spatial distribution of the population of a region; identify risk groups for the disease among the various demographics and the social behavior of the participating demographic groups. (individuals who are overweight; have low levels of physical activity, poor eating habits; have a family history of T2DM; people belonging to a certain race/ethnic group, as well as people living in poor economic conditions.)

Literature Review

An extensive literature review on the incidence and prevalence of Type II diabetes among 15-20 year olds can be used in traditional models such as logistic regression using predictive covariates and relative risks; however, the advent of Markov Chain Monte Carlo (MCMC) simulations allows for much greater accuracy in quantifying the future trends of Type II diabetes among this age group.

The TODAY trial on diabetes in youth uncovers obesity as one of the primary cause for T2DM The National Institutes of Health showed that trends in obesity have been increasing from 1971 to 2006, for the age group of 15-20 year olds and have a higher chance of acquiring T2DM quickly. Childhood obesity has more than doubled in younger children and tripled in adolescents in the past 30 years.

The National Diabetes Statistics in 2011 stated that during 2002 to 2005, 3,600 youth were newly diagnosed with T2DM annually. The estimated costs for medical care were about $116 billion. The trends in T2DM have been gradually increasing from the year of 1971, and can be seen as a huge problem.

The National Diabetes Statistics in 2011 found that about 215,000 people younger than 20 years had some form of diabetes in the United States in 2010. [15] The study reported children in certain ethnic backgrounds had a higher prevalence number. On an average the prevalence rates were 87 percent higher for Mexican Americans, 94 percent higher for Puerto Ricans, 18 percent higher for Asian Americans, 66 percent higher for Hispanics/Latinos, and 77 percent higher for blacks. In 2002-2005, 3600 youth were newly diagnosed with T2DM.

Significance

Significance:

The goal of this study work was to simulate a predictive model for understanding the disease dynamics by risk factor and exposure. The Monte-Carlo method provides two major advantages: it is preferred when the raw data is not available, and it accounts for indirect effect due to mediators among the covariates. 

Methods

A predictive model of risk progression for T2DM was developed using logistic regression and

Bayesian inference by Markov chain Monte Carlo method. One-year cycles were used for the disease

progression in this model. Primary end points for progression were transition to “no-diabetes” to “pre-

diabetes” and “diabetes” state. The three-state model partitions the sample dataset into “no diabetes”,

“pre-diabetes” and diagnosed “diabetes” compartments. Independent covariates included –

Ethnicity

Gender

Obesity

Age

The model can be explored further for -

Sedentary behavior

Family history of T2DM

Consuming high density, low-nutrient food and drinks

Income level

BPA levels in urine

Cox Logistic Regression for

Survival Analysis

A simple two-state model for modeling a chronic disease such as T2DM

is the three-state model for survival data with one transient state ‘0: alive’

and one absorbing state ‘1: dead’. In general, an absorbing state is a state

from which further transitions cannot occur while a transient state is a state

that is not absorbing. The observation for a given individual will here in the

most simple form consist of a random variable, say T, representing the time

from a given origin (time 0) to the occurrence of the event ‘death’. The

distribution of T may be characterized by the probability distribution function

F(t)= Prob(T <= t) or, equivalently, by the survival distribution function S(t)=

1-F(t) = Prob(T > t). It is seen that S(t) and F(t), respectively, correspond to

the probabilities of being in state 0 or 1 at time t. If every individual is

assumed to be in state 0 at time 0 then F(t) is also the transition probability

from state 0 to state 1 for the time interval from 0 to t.

Bayesian Inference

Bayes' theorem provides a mathematical method that can be used to calculate, given

occurrences in prior trials, the likelihood of a target occurrence in future trials. According to

Bayesian logic, the only way to quantify a situation with an uncertain outcome is through

determining its probability. Let y be a set of covariate observations (scalar or vector) at

discrete time points . Let θ represent a vector of free parameters. The goal is to find the set

of parameters that best fits the data and to evaluate how good the model is. Bayesian

statistical conclusions about a parameter θ, or unobserved data, ŷ, are made in terms of

probability statements. These probability statements are conditional on the observed value

of y and are simply written as p(θ|y) or p(ŷ|y). The goal is to find the set of parameters that

best fits the data and to evaluate how good the model is. The best way to do this is to use

Bayesian inference and model comparison, which can be computed using the Markov

Chain Monte Carlo (MCMC). However, the MCMC can also be used just to get the

parameters in the sense of finding the best fit according to some criterion.

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) simulation is a general method based on drawing

samples from a known sample or prior distribution, p(y|θ) to better approximate the target

posterior distribution, p(θ|y), depending on the last value drawn; hence the draws form a

Markov chain. The key to the success of the method is that the approximate distributions

are improved at each step in the simulation, in the sense of converging to the target

distribution. In MCMC, several independent sequences of simulation draws are created;

each sequence, θt, t = 1, 2, 3… is produced by starting at some point θ0 and then, for each

t, drawing θt from a transition distribution, Tt(θt | θt-1) that depends on the previous draw,

θt-1. It is often convenient to allow the transition distribution to depend on the iteration

number t; hence the notation Tt. The transition probability distributions must be constructed

so that the Markov Chain converges to a unique stationary distribution that is the posterior

distribution, p(θ|y). MCMC is used when it is not feasible to sample θ directly from p(θ|y).

The samples are taken iteratively in such a way that at each step of the process it can be

expected to draw from a distribution that becomes closer and closer to p(θ|y). The key to

MCMC is to create a Markov process whose stationary distribution is the specified p(θ|y)

and run the simulation long enough that the distribution of the current draws is close

enough to this stationary distribution. The adaptive rejection Metropolis sampling (ARMS)

algorithm to draw the Gibbs samples is used in my model

Results Table 1:

Table 2:

Model Information Data Set WORK.DIABETES Dependent Variable Time Survival Time Censoring Variable T2DMStatus 0=No Diabetes 1=Pre-diabetes 2=Diabetes Censoring Value(s) 0 1 Model Cox Ties Handling DISCRETE Sampling Algorithm ARMS Burn-In Size 2000 MC Sample Size 20000 Thinning 1 Table 3: Table 5: Summary of the Number of Event and Censored Values

Total Event Censored Percent Censored

326 117 209 64.11 Table 4:

Table 6:

Table 7: Bayesian Analysis from the program Table: 8

Number of Observations Read Number of Observations Used

326 326

Maximum Likelihood Estimates

Parameter DF Estimate Standard Error 95% Confidence Limits

BMI 1 0.0515 0.0183 0.0157 0.0873 Sex 1 -0.4075 0.2353 -0.8687 0.0536 Race 1 0.3690 0.0741 0.2237 0.5143 Age 1 0.0922 0.0575 -0.0206 0.2049

Uniform Prior for Regression Coefficients Parameter Prior BMI Constant Sex Constant Race Constant Age Constant

Initial Values of the Chain Chain Seed BMI Sex Race Age 1 1 0.0515 -0.4075 0.3690 0.0922

Posterior Summaries

Parameter N Mean Standard Deviation

Percentiles 25% 50% 75%

BMI 20000 0.0524 0.0184 0.0400 0.0522 0.0644 Sex 20000 -0.4122 0.2360 -0.5700 -0.4115 -0.2511 Race 20000 0.3685 0.0744 0.3196 0.3694 0.4191 Age 20000 0.0942 0.0575 0.0550 0.0939 0.1326

Posterior Intervals Parameter Alpha Equal-Tail Interval HPD Interval BMI 0.050 0.0168 0.0893 0.0167 0.0891 Sex 0.050 -0.8832 0.0430 -0.8732 0.0514 Race 0.050 0.2197 0.5118 0.2260 0.5177 Age 0.050 -0.0163 0.2078 -0.0170 0.2068

Posterior Correlation Matrix Parameter BMI Sex Race Age BMI 1.0000 0.2060 0.1808 0.0396

Results

Results

Results

Results

Results

● Number of ScoreVariables Chi-Square Variables Included in Model

1 25.5541 Race

1 13.9674 Age

1 7.2832 BMI

2 39.4908 Sex Race

2 37.9123 BMI Race

2 32.2592 Race Age

3 44.6657 BMI Sex Race

3 43.6418 BMI Race Age

3 41.3759 Sex Race Age

4 47.1469 BMI Sex Race Age

Regression Models Selected by Score Criterion

DiscussionThe Cox logistic regression yielded the maximum likelihood estimates

for the four independent variables. The result with respect to Age and Sex

were disregarded due to the noted confidence limits which included 0. Out

of the remaining variables, Race was the strongest predictor (MLE 0.3690,

SE 0.0741, 95% CL 0.2237-0.5143) followed by BMI (MLE 0.0515, SE

0.0183, 95% CL 0.0157-0.0873) Correlation between the independent

variables is shown in the Bayesian estimation figures. The two co-variates

that show the highest correlation are Age and Sex (0.3652) followed by and

BMI and Sex (0.2171), and Race and BMI (0.1738). These inter-variable

correlations when compared to their associations with the outcome would be

valuable in determining the presence of confounding. The posterior

summary generated by the Bayesian MCMC also shows that Race was the

dominant covariant in determining the outcome variable. This may be due to

cultural (socioeconomic status, diet, et al.) or genetic predispositions and

would be valuable to explore further in subsequent analysis.

Conclusions

The alarming incidence of Type II Diabetes (T2DM) in both children and

young adults requires immediate intervention to minimize the morbidity

among those who have been diagnosed, and prevent future cases from

occurring in this demographic. If the current trend continues, there will be a

significantly greater prevalence of cardiovascular disease, peripheral

neuropathy, infection, and ultimately – disability that will affect a tremendous

burden on the U.S. healthcare system. The dependency ratio helps to

describe the proportion of the population that is economically dependent.

Even as we do not have longitudinal data on the morbidity and mortality of

T2DM among children and young adults, we can certainly suggest that the

dependency ratio will be increased as a result of this occurrence based on

data gleaned from studies conducted on adult population

Conclusions(cont'd)In order to address this issue, it is important to provide robust predictive models to the

important stakeholders from which we can acquire resources and ensure the most effect

allocation of those resources. The results of our analysis identified Race/Ethnicity as the most

significant factor affecting the outcome of T2DM. Race may represent either genetic

predisposition, environmental factors, or both. If we assume that our results are externally valid,

then it suggests that the most effective interventions would be targeted towards high risk groups

based on race / ethnicity. Further studies should be carried out that help identify contributors to

this risk category. Valid raw data using this analytic method will yield valuable evidence that can

better define the primary determinants of disease specific to this population. Additionally, the

results of this method decrease the margin of error and measures of variation that characterize

traditional predictive modeling. This research has shown that using Bayesian methods of

inference using a Markov Chain Monte Carlo simulation, chronic disease progress and its

associated risk factors can be studied as follows:

1. Estimate missing data- The Bayesian models were able to estimate prevalence rates for

diseases and risk factors with limited data input allowing for estimations to be made for even

lesser studied diseases and risk factors.

2. Incorporate any available prior information- in the case where there was no prior information,

this was still fine as non-informative priors or flat priors could be used.

3. Include additional predictors- The models are open and so addition of new predictors is easy.

References [1] Harris MI: Prevalence of noninsulin-dependent diabetes and impaired glucose tolerance. Chapter VI In: Diabetes in America, Harris MI, Hamman RF, eds. NIH publ. no. 85-1468, 1985[2] Dept of Health and Human Services, C. f. (2011). http://www.cdc.gov/diabetes/pubs/pdf/search.pdf.[3] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2762509/[4] National Health Interview Survey (NHIS, available at http://www.cdc.gov/nchs/nhis.htm) of the National Center for Health Statistics (NCHS)[5] Bloom, D. C.-L.-G. (2011). The Global Economic Burden of Noncommunicable Diseases. Geneva: World Economic Forum.[6] Search for Diabetes in Youth http://www.cdc.gov/diabetes/pubs/pdf/search.pdf[7] Presented at the 72nd American Diabetes Scientific Sessions, June 9, 2012, ADA: Both Type 1 and T2DM Rates Increase Significantly among American Youth[8] Constantino, Maria I. "Long-Term Complications and Mortality in Young-Onset Diabetes: Type 2 Diabetes Is More Hazardous and Lethal than Type 1 Diabetes." Diabetes Care. Diabetes Centre, 11 July 2013. Web.<http://www.ncbi.nlm.nih.gov/pubmed/?term=Long-Term+Complications+and+Mortality+in+Young-Onset+Diabetes>[9] SEARCH for Diabetes in Youth http://www.cdc.gov/diabetes/pubs/pdf/search.pdf[10] Obesity and T2DM in children and youth (Kaufman, 2006)[11] T2DM in youth:rates, antecedents, treatment, problems and prevention, Editorial Pediatric Diabetes 2007, Sept 9 (4-6)[12] TODAY trial TREATMENT PROTOCOL https://portal.bsc.gwu.edu/documents/11448/d69340ae-3443-4bd6-81f4-1d8d5f2ae369[13] Ogden CL, Carroll MD, Kit BK, Flegal KM. Prevalence of obesity and trends in body mass index among US children and adolescents, 1999-2010. Journal of the American Medical Association2012;307(5):483-490[14] National Center for Health Statistics. Health, United States, 2011: With Special Features on Socioeconomic Status and Health. Hyattsville, MD; U.S. Department of Health and Human Services; 2012.[15] National Institutes of Health, National Heart, Lung, and Blood Institute. Disease and Conditions Index: What Are Overweight and Obesity?  Bethesda, MD: National Institutes of Health; 2010.[16] Krebs NF, Himes JH, Jacobson D, Nicklas TA, Guilday P, Styne D. Assessment of child and adolescent overweight and obesity. Pediatrics 2007;120:S193–S228[17] Daniels SR, Arnett DK, Eckel RH, et al. Overweight in children and adolescents: pathophysiology, consequences, prevention, and treatment. Circulation 2005;111;1999–2002.[18] William H. Crown, Some Thoughts on Reformulating the Dependency Ratio, Gerontological Society of America, 1995

Acknowledgments I would like to acknowledge Dr. Vladimir Shapovalov for his work in guiding me as his research student and Lata Ganesh from the World Bank Organization in supervising me with my research and providing me with the access to texts, information and consultation to finalize this work, as well as David Mordecai, Samantha Kappagoda and Daniel Stein from New York University for assisting me with the mathematical concepts.