Modelling Latent Variables for Bayesian Networks

University of Minnesota Morris Digital Well University of Minnesota Morris Digital Well

University of Minnesota Morris Digital Well University of Minnesota Morris Digital Well

Undergraduate Research Symposium 2016 Undergraduate Research Symposium

4-2016

Modelling Latent Variables for Bayesian Networks Modelling Latent Variables for Bayesian Networks

Charles Cain University or Minnesota - Morris, [email protected]

Follow this and additional works at: https://digitalcommons.morris.umn.edu/urs_2016

Part of the Statistics and Probability Commons

Recommended Citation Recommended Citation Cain, Charles, "Modelling Latent Variables for Bayesian Networks" (2016). Undergraduate Research Symposium 2016. 2. https://digitalcommons.morris.umn.edu/urs_2016/2

This Book is brought to you for free and open access by the Undergraduate Research Symposium at University of Minnesota Morris Digital Well. It has been accepted for inclusion in Undergraduate Research Symposium 2016 by an authorized administrator of University of Minnesota Morris Digital Well. For more information, please contact [email protected].

https://digitalcommons.morris.umn.edu/

https://digitalcommons.morris.umn.edu/urs_2016

https://digitalcommons.morris.umn.edu/urs

https://digitalcommons.morris.umn.edu/urs_2016?utm_source=digitalcommons.morris.umn.edu%2Furs_2016%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/208?utm_source=digitalcommons.morris.umn.edu%2Furs_2016%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

https://digitalcommons.morris.umn.edu/urs_2016/2?utm_source=digitalcommons.morris.umn.edu%2Furs_2016%2F2&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Modeling Latent Variables for Bayesian NetworksCharles H. CainUniversity of Minnesota Morris Division of Science and Mathematics

Department of StatisticsUniversity of Minnesota Morris600 E 4th StreetMorris, MN 56267

Phone: 651-395-1209Email: [email protected]

AbstractCausal Inference has been a large area of study in Biostatistics and Epidemiology to help explain what specifi-

cally causes certain diseases and conditions. Bayesian Networks use conditional probabilities in order to make thesecausal inferences. The simplest types of Bayesian Networks are networks made up of observable discrete or discretizedcontinuous variables. However, many authors have shown advantages of continuous variables in Bayesian Networksover their discretization. Also, the use of latent variables or hidden variables have been used as a factor in explainingvariables inside the Bayesian Network, discrete or continuous. This research looks at latent variables as a weightedsum of observed variables much like in Path Analysis or Structural Equations. We then use these modeled latentvariables as continuous variables in a Bayesian Network. As an example we will look at a Bayesian Network of thecausation of Diabetes using data from NHANES and modeling a latent variable, “Lack of Physical Activity”, as aweighted sum of variables in the data.

Introduction

Bayesian Networks have been a large focus in statistics and in causal analysis in particular. They uti-lize conditional probabilities in order to make causal inferences. One concern with Bayesian Networksis latent variables. Various research has been done on the best way to learn them, see Anandkumar etal. [1] and Bishop et al. [2]. The method proposed is using a weighted sum of observed variables fromdata to model a latent variable in the same way that latent variables are modeled in Path Analysis fromSanchez [3].

To demonstrate how these methods for latent variable modeling work, a model involving Type II Di-abetes is considered shown in Figure 3. According to the National Institute of Diabetes and Digestiveand Kidney Diseases Type II Diabetes is caused by High Age, Black or Hispanic Race, having a relativewith Diabetes, obesity/overweight, Metabolic Syndrome, and Physical Inactivity. All of these variablesare observable and discrete except for Physical Inactivity which is the latent variable. Thus, PhysicalInactivity is the main focus in the example

Latent Variable Modeling

Latent Variable modeling with a weighted sum results in dimension reduction since observed vari-ables are combined into one. This is demonstrated in Figure 1 with depression modeled by observablevariables

Figure 1: A latent variable (Depression) modeled by observable variables

Path Analysis models latent variables as a weighted sum of observed variables, where the weights arefound through an iterative process involving regression and correlation outlined by Sanchez [3]. Byusing this weighted sum a latent variable score can be obtained.

To use the Path Analysis Method in a Bayesian Network obtain the weights for a latent variable, usinga predefined graph, identify the latent variables in the model as well as the observed variables to beassociated with each latent variable. Then, run a Path Analysis with the defined graph to obtain theweights. Using those obtained weights, form the latent variable scores as a continuous variablesunlessthe observed variables forming a latent variable are all discrete. Finally, with all of the latent variablesdefined, learn the Bayesian Network with all variables, latent and observed.

Figure 2: An example of a Path Analysis graph drawn in Lisrel Student

Figure 3: The Directed Acyclic Graph from the Bayesian Network used to model Type II Diabetes

Figure 4: Physical Inactivity modeled by its observed variables

Example with Type II Diabetes

Using the 2011-2012 NHANES data a Bayesian Network involving Type II Diabetes is considered.The graph of the Bayesian Network is shown in Figure 3 based on information from the Mayo Clinic[5] and the National Institute of Diabetes and Digestive and Kidney Diseases [4]. Race, Age, MetabolicSyndrome, Obesity, Physical Inactivity, and having a relative with Diabetes are all thought to have aneffect on having Type II Diabetes. Being African American or Hispanic are though to increase risk forType II Diabetes as well as being above the age of 45.

Since Physical Inactivity is latent, it is modeled with observable variables from the dataset shown inFigure 4. Using the Path Analysis Method, Physical Inactivity is represented with equation 1. So, ahigher score means more physically inactive.

−0.156(Sedentary) + 0.844(TV )− 0.033(Moderate)− 0.526(V igorous). (1)

A continuous Physical Inactivity score results in Type II Diabetes and Metabolic Syndrome having lo-gistic probability distributions in equations 2 and 3 respectively for someone with traits ”X” (MetabolicSyndrome, Obesity, a relative with Diabetes, but not African or Hispanic).

Table 1 shows the probability of having Type II Diabetes given different levels of Physical Inactivitymodeled by the proposed Path Analysis Method compared to using the first Principle Component. ThePath Analysis Method shows a higher Physical Inactivity resulting in a higher probability of havingType II Diabetes compared to using the first principle component which has a negligible effect on theprobability of Type II Diabetes.

e−0.581+0.136(Physical Inactivity)

1 + e−0.581+0.136(Physical Inactivity). (2)

Method P (Diabetes|PI = 90%, X) P (Diabetes|PI = 10%, X)Principle Component 0.381 0.362

Path Analysis 0.404 0.323

Table 1: Probabilities of having Type II Diabetes

Table 2 shows the probability of having Metabolic Syndrome given different levels of Physical Inactiv-ity modeled by the proposed Path Analysis Method compared to using the first Principle Component.The Path Analysis Method shows a higher Physical Inactivity resulting in a higher probability of havingMetabolic Syndrome compared to using the first principle component which has a negligible effect onthe probability of Metabolic Syndrome.

e−1.787+0.296(Physical Inactivity)

1 + e−1.787+0.296(Physical Inactivity). (3)

Method P (Metabolic|PI = 90%, X) P (Metabolic|PI = 10%, X)Principle Component 0.464 0.454

Path Analysis 0.516 0.398

Table 2: Probabilities of having Metabolic Syndrome

Conclusion

Using the weights found through an iterative process in Path Analysis, we can obtain a score torepresent a named latent variable to be used in a Bayesian Network. Using this method allows us toconsider the other variables in the Bayesian Network while also utilizing the observed variables resultingin dimension reduction.

In a Bayesian Network used for explaining Type II Diabetes the latent variable for Physical Inactiv-ity can modeled by the Path Analysis Method. A higher Physical Inactivity Score results in a higherprobability of having Type II Diabetes as we would expect based on information from the Mayo Clinic.Compared to modeling the Physical Inactivity with the first principle component modeling the latentvariable using the Path Analysis Method modeled the conditional probability for diabetes with a largereffect from Physical Inactivity.

Further Research

We saw an example of the proposed method working with one latent variable. However, more researchneeds to be done to confirm that the method would work for multiple latent variables in a BayesianNetwork. Also, research on representing latent variables on a larger, more complicated model needs tobe done. Bayesian Networks allow for backwards inference so a more complicated model would demon-strate latent variables usefulness for the different types of inference in Bayesian Networks. Finally, beingable to apply this method to a model with unknown structure would be crucial in order to fully immersethe Path Analysis Method into Bayesian Networks.

References

[1] Anandkumar, Animashree, Daniel Hsu, Adel Javanmard, and Sham M. Kakade. “Learning Lin-ear Bayesian Networks with Latent Variables.” JMLR 28 (2013): n. pag. Columbia.edu. ColumbiaUniversity, 2013. Web. 15 July 2015.

[2] Christopher M. Bishop, Lazic, Nevena, and John Winn. “Structural Expectation Propagation (SEP):Bayesian Structure Learning for Networks with Latent Variables.” JMLR: W&CP 31 (2013): n. pag.Jmlr.org. Journal of Machine Learning Research, 2013. Web. 15 July 2015.

[3] Sanchez, G. PLS Path Modeling with R. Trowchez Editions: Berkeley (2013. Web. 14 May 2015

[4] “Causes of Diabetes.” National Diabetes Information Clearinghouse. National Institute of Health,n.d. Web. 09 June 2015.

[5] “Metabolic Syndrome.” Mayoclinic.org. Mayo Clinic, 22 Aug. 2014. Web. 16 July 2015.

Acknowledgements

• This project was supported in part by a grant to the University of Minnesota, Morris from the HowardHughes Medical Institute through the Precollege and Undergraduate Science Education Program.

• Thank you to Professor Engin Sungur for advising and guiding my research.

Documents

Modelling Latent Variables for Bayesian Networks