71
UNIVERSITY OF ZIMBABWE COLLEGE OF HEALTH SCIENCES DEPARTMENT OF COMMUNITY MEDICINE LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSIS IN THE EVALUATION OF FACTORS ASSOCIATED WITH STUNTING IN CHILDREN: DIVERGENCE AND SIMILARITY OF THE STATISTICAL METHODS. Rutunga L. R944608E Supervisors: Professor S. Rusakaniko Mr V Chikwasha A dissertation submitted in partial fulfillment of the Master of Science Degree in Biostatistics.

DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

UNIVERSITY OF ZIMBABWE

COLLEGE OF HEALTH SCIENCES

DEPARTMENT OF COMMUNITY MEDICINE

LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSI S

IN THE EVALUATION OF FACTORS ASSOCIATED WITH

STUNTING IN CHILDREN: DIVERGENCE AND SIMILARITY OF

THE STATISTICAL METHODS.

Rutunga L. R944608E

Supervisors: Professor S. Rusakaniko

Mr V Chikwasha

A dissertation submitted in partial fulfillment of the Master of Science

Degree in Biostatistics.

Page 2: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

ii

DECLARATION FORM

STUDENT:

I do hereby declare that this dissertation is the original work of LOYCE RUTUNGA and has

not been submitted before to the University of Zimbabwe or any other institution for the

fulfilment of any degree requirements.

Name …………………………………………………………………………………………..

Signature ............................................................................................Date ................................

SUPERVISOR:

I certify that I have supervised the writing of this dissertation and declare that it is indeed the

original work of the student in whose name it is being submitted.

Name …………………………………………………………………………………………..

Signature ............................................................................................Date ................................

DEPARTMENTAL CHAIRPERSON:

I do hereby declare all of the above statements to be true.

Name …………………………………………………………………………………………..

Signature ............................................................................................Date ................................

Page 3: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

iii

ABSTRACT

Background: Stunting is a well-established child health indicator of chronic malnutrition

which is associated with biological, environmental and socioeconomic factors. Logistic

regression and linear discriminant analysis are two statistical methods that can be used to

predict or classify subjects as either stunted or not stunted based on all or a subset of

measured predictor variables. The predictive accuracy of the two methods were compared

with respect to several attributes of each of the methods.

Methods: Data used for the study was extracted from the Zvitambo trial data set. The

multivariable logistic regression and linear discriminant models were fitted using 20

bootstrap samples for cross validation of the coefficients. The two models were compared

with respect to the variables selected, the sign and magnitude of the coefficients, sensitivity,

specificity, overall classification rate and areas under ROC curves. The two methods were

applied in combination to check if predictive accuracy would improve.

Results: Logistic regression and linear discriminant analysis had the same predictive

accuracy with classification rates of 78.76% and 78.86% respectively. Both methods

identified two common factors, sex and birth weight, and the coefficients of the two factors

had the same negative sign but the magnitude differed significantly, both had low sensitivity

(13.19% and 8.68%) and high specificity (97.44% and 98.24%). Combining the two methods

did not improve predictive accuracy (71.5% before and 70.24% after).

Conclusion: The two multivariable techniques tend to converge in classification accuracy

mainly when the sample size is large (>50) but when faced with making a choice between the

two, it is recommended to use the method whose assumptions for application are fulfilled.

Page 4: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

iv

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my supervisors, Professor S. Rusakaniko and

Mr V. Chikwasha for their valuable academic guidance throughout this project and also to Mr

W Tinago and Mr G Mandozana for their occasional contribution. My acknowledgement

would be incomplete without extending my heartfelt appreciation to Zvitambo Institute for

Maternal and Child Health Research for allowing me to use their data for this research, with

special gratitude to Mr R Ntozini, Mr B Chasekwa and Dr M Mbuya who took a special

interest and commitment to see me through this project.

Last but not least, a special thank you to my fellow classmates for helping me to remain

focussed on the reason we started this academic journey and my sincerest gratitude goes to

my boys for soldiering on for so long without me.

Page 5: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

v

Table of Contents

DECLARATION FORM ..................................................................................................................... ii

ABSTRACT .......................................................................................................................................... iii

ACKNOWLEDGEMENTS ................................................................................................................ iv

CHAPTER ONE: INTRODUCTION ................................................................................................. 1

1.0 Stunting ......................................................................................................................................... 1

1.1Multivariable Statistical Techniques .............................................................................................. 3

1.2 Description of the Original Study ........................................................................................... 4

1.2.1 Research Primary Objectives ............................................................................................... 5

1.2.2 Specific Objectives of the Zvitambo Study ........................................................................... 6

1.2.3 Subjects, Materials and Methods of the Zvitambo Study ...................................................... 7

1.2.4 Data Management ................................................................................................................ 10

1.2.5 Data Analysis ....................................................................................................................... 10

1.3 Critical Appraisal of the Study.................................................................................................... 11

1.3.1 Research Primary Objectives ............................................................................................... 11

1.3.2 Specific Objectives of the Study .......................................................................................... 12

1.3.3 Study Design ........................................................................................................................ 13

1.3.4 Sample Size .......................................................................................................................... 13

1.3.5 Sampling Methods .............................................................................................................. 14

1.3.6 Data Collection Methods ..................................................................................................... 14

1.3.7 Data Analysis ....................................................................................................................... 15

1.4 Quality of Data ............................................................................................................................ 16

1.5 Problem Statement ..................................................................................................................... 17

CHAPTER TWO: LITERATURE REVIEW .................................................................................. 18

2.1 Stunting ....................................................................................................................................... 18

2.2 Logistic Regression and Discriminant Analysis ......................................................................... 19

2.3 Research Questions ..................................................................................................................... 21

2.4 Justification Of The Study .......................................................................................................... 22

2.5 Research Objectives .............................................................................................................. 22

CHAPTER THREE: METHODOLOGY ......................................................................................... 23

3.1 Description of Data ..................................................................................................................... 23

3.2 Sample Size ................................................................................................................................. 24

3.3 Secondary Data Analysis Variables ............................................................................................ 25

3.4 Data Management ...................................................................................................................... 25

Page 6: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

vi

3.5 Statistical Analysis ...................................................................................................................... 26

3.6 Ethical Considerations ................................................................................................................ 29

CHAPTER FOUR: RESULTS .......................................................................................................... 30

4.1 Demographic Characteristics of the Participants .................................................................. 30

4.2 Univariate Analysis ............................................................................................................... 30

4.3 Logistic Regression Model ................................................................................................... 32

4.4 Linear Discriminant Analysis Model .................................................................................... 33

4.5 Comparison of Logistic Regression and Linear Discriminant Models ................................. 34

4.6 Linear Discriminant Analysis as an Exploratory Step for Logistic Regression .................... 37

CHAPTER FIVE: DISCUSSION ...................................................................................................... 40

5.1 Discussion of Results ............................................................................................................ 40

5.2 Limitations of the study ........................................................................................................ 43

CHAPTER SIX: CONCLUSION ...................................................................................................... 44

REFERENCES .................................................................................................................................... 45

Page 7: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

vii

LIST OF TABLES

Table 1: New Data Dictionary of Variables used in Secondary Analysis ...................................... 26

Table 2 : Demographic Characteristics of the Participants ............................................................ 31

Table 3: Results of Univariate Analysis to establish association of individual variables with stunting. ............................................................................................................................................... 32

Table 4: Results of Bootstrapped Logistic Regression ..................................................................... 33

Table 5: Results of Bootstrapped Linear Discriminant Analysis ................................................... 34

Table 6: Comparison of Logistic Regression and Linear Discriminant Analysis in terms of Sensitivity, Specificity and Classification Accuracy ........................................................................ 35

Table 7: Comparison of the Two Logistic Regression Models ........................................................ 38

LIST OF FIGURES

Figure 1: Receiver Operating Characteristics (ROC) curve for Logistic Regression model. ...... 36

Figure 2: Receiver Operating Characteristics (ROC) curve for Linear Discriminant Analysis model. ................................................................................................................................................... 37

Figure 3: Receiver Operating Characteristics (ROC) curve for the second Logistic Regression model. ................................................................................................................................................... 38

LIST OF APPENDICES

Appendix A: Joint Research Ethics Committee Approval .............................................................. 48

Appendix B: Letter of Authorisation to use Data ............................................................................ 49

Appendix C: Logistic Regression and Linear Discriminant Analysis ............................................ 50

Appendix D: Zvitambo Questionnaire Used for Data Collection ................................................... 54

Page 8: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

1

CHAPTER ONE: INTRODUCTION

1.0 Stunting

Stunting is a well established child health chronic malnutrition indicator associated

with biological, environmental and socio-economic factors. It is defined as having a

height/length-for-age which is more than 2 standard deviations below the median of the

National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth

reference.22 Height-for-age indicates the linear growth that has occurred at the time of taking

the anthropometric measurement, taken in an upright or standing position whilst length is

measured in a recumbent position. Stunting is therefore determined by measuring the height

or length of an infant or child as well as gender and age. Data for this purpose is therefore

readily available as the measurements are non-invasive and also cheap. In 2000, it was

established that 33% (182 million) of the world’s children were stunted and almost all of the

cases were found in developing countries, with 70% occurring in Sub-Saharan Africa and

South Asia.24,25 Prevalence of stunting in Zimbabwe according to the Zimbabwe

Demographic and Health Survey of 2012 stood at 33%.31 This research project, among other

objectives estimated the prevalence of stunting in the Zvitambo cohort which gives a rough

estimate of the prevalence of stunting in Harare urban as at the time of the trial (1997-2000).

Impaired development in children and later in adults may largely be as a result of low

birth weight. Low birth weight may be caused by intrauterine growth retardation which in

turn may be attributed to factors such as maternal under-nutrition, maternal smoking,

infection during gestation among other causes. Seeing that low birth weight has a long term

impact on children’s (and adults’) health, there is need to address the problem through

targeted interventions. Stunting is believed to develop within the first two years of life and

in most resource-limited countries there will be little or no recovery thereafter.24 It has been

Page 9: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

2

noted that surprisingly, for stunting, the length for age global mean at birth is very close to

the NCHS standard but growth starts to deteriorate soon after birth and persists well into the

third year of life.24 It is under the assumption that children do not grow well because of lack

of the proper foods in the right quantities23 that enormous research has focused on identifying

dietary solutions for stunting.26 Assistance in the form of nutritional support and education as

well as close monitoring of the high risk infants and management of those affected are

expected to go a long way in curbing the rate of stunting in children. The measurement of the

prevalence of stunting at a later stage in life reveals the success of such interventions

delivered to low birth weight infants.22

The studies carried out in various countries were aimed at understanding the risk

factors for stunting so as to inform public policy and develop interventions that would target

the identified factors. Stunting is of public health importance because it impacts on child

mortality, a child’s cognitive development and adult economic productivity.26 Thus stunted

children tend to be slow learners in school and generally develop into underperforming adults

in life because stunting causes physical and functional deficits. Due to the public health

impact of stunting, WHO collects prevalence data at national levels which is standardized in

a systematic way to allow for inter-national comparisons and identify countries with greater

need for interventions. WHO and the Centre for Disease Control (CDC) have availed free

software packages, ANTHRO and EPIINFO for the standardized computation of Z-scores

which measure various nutritional indicators such as height-for-age, weight-for-age,

nutritional status, etc.

The data collected by the Zimbabwe Vitamin A for Mothers and Babies

(ZVITAMBO) study is ideal for the evaluation of the risk factors for stunting in children as it

follows up babies from birth to 24 months recording a wide variety variables. Anthropometric

measurements such as length and weight were taken at all follow-up visits, gender and age

Page 10: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

3

also recorded thus enabling the calculation of Z-scores for height-for-age which would

determine whether an infant is stunted or not. Other variables like the mother’s age,

education level, occupation, housing, number of live births, birth interval, household income,

baby’s gestational age and birth weight were also recorded and constitute the potential risk

factors for stunting.

1.1Multivariable Statistical Techniques

Research, mainly medical research usually focuses on the relationship of an outcome

with multiple covariates, normally possible risk factors. Multivariable statistical techniques

are usually used for analysis. Examples of such techniques are multiple linear regression,

logistic regression, poisson regression, discriminant analysis, etc. Each technique has its

assumptions and conditions best suited for its use. Multiple linear regression is known to be a

very flexible multivariable regression technique used to analyze relationships between

multiple independent variables and a single continuous dependent variable.9 Its popularity is

based on its ability to handle all types of independent variables, namely continuous and

categorical but it however falls short when it comes to categorical dependent variables.18

Discriminant analysis and logistic regression are two widely used multivariable analytical

regression techniques for analyzing categorical outcomes.11

Logistic regression is a type of regression which is used when the dependent or

outcome variable is binary, discrete or categorical and the predictor or independent variables

are of any kind. It is particularly useful in health sciences as the dichotomous outcome is

often the presence or absence of some health condition or disease. Unlike linear regression,

logistic regression uses the logit transformation to predict group membership based on

several covariates irrespective of their underlying distribution thus it avoids predicting

negative probabilities of group membership.28,33 It is especially important when the outcome

Page 11: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

4

has a non-linear (sigmoidal) relationship with the independent variables.33 Logistic

regression analysis is based on the calculation of the odds of an outcome, which is the ratio of

the probability of having an outcome or belonging to one group divided by the probability of

not having the outcome or not belonging to that group. Discriminant analysis is a similar

classification technique which is used to determine which set of predictor variables strongly

discriminate between two or more naturally occurring, mutually exclusive groups. It

estimates orthogonal discriminant functions, which are linear combinations of the

standardised independent covariates which yield the largest mean differences between

groups.28.33

Thus the two multivariable regression techniques are similar in many aspects but also

have distinct differences which result in logistic regression being more popularly used than

linear discriminant analysis. Both methods are useful when the outcome variable is

categorical, each technique may be used to answer questions for which the other is designed,

though estimators are calculated using different methods.4,12 The major difference between

the two techniques is that, discriminant estimators are more powerful when the covariates are

normally distributed with equal covariance, an assumption not necessary with logistic

regression. Thus in circumstances where the normality assumption is not significantly

violated discriminant analysis and logistic regression may be used to solve the same problem,

rendering a comparison of the two techniques possible based on some measures of predictive

accuracy.11 This research, therefore, seeks to compare the two methods in predicting stunting

in children using a range of predictor variables.

1.2 Description of the Original Study

Maternal-to-child transmission accounts for 90% of all paediatric HIV infections

worldwide. Among breastfeeding populations, vertical transmission rates of 20-40% have

Page 12: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

5

been reported, of which about a third of these occur during breastfeeding. This risk is a great

dilemma for most African countries where 30% of lactating women are HIV positive and

alternative feeding choices are out of reach and would risk the numerous infant lives saved by

breastfeeding annually. Thus, to balance the two risks, an intervention which reduces the

infectiousness of breast milk of HIV positive women and could be applied universally was

needed.

Emerging data has indicated that, vitamin A deficiency in HIV positive women is

associated with higher breast milk viral load and higher vertical transmission rates. This

therefore suggests that maternal vitamin A supplementation in the immediate postpartum

period may reduce the risk of transmission during lactation. Vitamin A supplementation in

HIV negative women will also improve the vitamin A status of the mother and her breast fed

infant. In addition, the supplementation will also reduce the risk of horizontal transmission in

HIV negative women during the postpartum year when they are particularly at high risk of

getting infected. Vitamin A supplementation to the neonate would also have an additional

benefit of substantially reducing early infant mortality.

1.2.1 Research Primary Objectives

To determine if oral administration of single doses of vitamin A to mothers and

neonates during the immediate postpartum period will reduce:

1. Vertical HIV transmission during lactation by at least 30%,

2. Horizontal HIV transmission among seronegative women by at least 25%, and

3. Infant mortality by at least 30%.

Page 13: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

6

1.2.2 Specific Objectives of the Zvitambo Study

1. To determine if oral administration of a single 400 000IU dose of vitamin A given

during the immediate post-partum period to HIV seropositive lactating mothers

will reduce HIV transmission via breastfeeding by 30%.

2. To determine if oral administration of a single 400 000IU dose of vitamin A given

during the immediate post-partum period to HIV seronegative lactating mothers

will reduce their rate of seroconversion during the post-partum year by at least

25%.

3. To determine if oral administration of a single 50 000IU dose of vitamin A given

to neonates, a single 400 000IU dose of vitamin A given to lactating mothers or

supplementation of both mother and infant during the immediate post-partum

period will reduce infant mortality by at least 30%.

1.2.2.1 Secondary Objectives:

i. To examine the association between maternal vitamin A status and viral load in

breast milk and plasma in women, and determine if vitamin A supplementation

reduces plasma and breast milk viral load and increases CD4 lymphocytes;

ii. To examine the timing of post-partum vertical HIV transmission and determine

whether maternal-neonatal vitamin A supplementation affects this timing;

iii. To investigate the relationship of acute phase reactants with serum retinol in sick

and healthy children and describe how serum retinol measures might be adjusted

by concurrent acute phase reactant measures to control for the effects of infection.

Page 14: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

7

1.2.3 Subjects, Materials and Methods of the Zvitambo Study

1.2.3.1 Study Design

The study was a 2x2 factorial randomized, double-masked, placebo controlled trial of

14 110 mother-infant pairs randomized to one of four treatment arms:

Vitamin A Dose

Treatment Infant Mother

I(aA) 50 000IU 400 000IU

II(aP) 50 000IU Placebo

III(pA) Placebo 400 000IU

IV(pP) Placebo Placebo

Randomization was stratified by infant birth weight, thus according to

estimated distribution of all births 90% of participants were normal birth weight

babies whilst 10% were low birth weight babies. This was done in order to control

for potential confounding since low birth weight is a risk factor for adverse health

outcomes in babies.

1.2.3.2 Study Area and Target Population

The study area was greater Harare urban, Chitungwiza and Epworth. The

study was health facility-based with the participants being enrolled at any one of the

following health centres: Harare Central Hospital, Chitungwiza Hospital and twelve

City of Harare clinics. The target population was all women of child-bearing age (15-

49 age group) in Harare and Chitungwiza.

Page 15: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

8

1.2.3.3 Study Participants

Mothers who delivered their infants at any one of the fourteen research sites

were eligible to participate if both the mother and her infant did not suffer any life-

threatening complications during delivery, if the infant weighed at least 1 500 grams

at birth, and if the mother intended to stay in Harare for at least two years after

delivery.

1.2.3.4 Sample Size

Sample size calculations that yielded 14000 participants for the main trial and

primary objective of vertical HIV transmission were done based on the following

assumptions:

1. The prevalence of HIV infection among women enrolled in the study was 30% at

baseline.

2. The mother-to-child transmission rate among HIV+ mothers in the control group

was 30% by 24 months and 10% during breast feeding.

3. The seroconversion rate among the HIV- mothers in the control group during the

first year post partum would be 6%.

4. The infant mortality rate in the treatment arm in which both mother and infant

received placebo would be 60 per 1000.

5. The use of two-tailed tests with an overall type I error of 5% and a type II error of

20%.

6. A reduction in total post partum HIV vertical transmission of 20%.

7. A retention rate of 90% of all participants.

Sample sizes for the other two primary objectives and the sub-studies were calculated

based on the main trial sample size and the relevant prevalence and anticipated

reduction rates.

Page 16: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

9

1.2.3.5 Sampling Methods

The 14 110 participants were recruited on Mondays to Fridays over a period of

18 months and were distributed proportionately between Harare Hospital(60%) and

the City Health Service Clinics(40%) according to the usual births distribution in the

city. At all the fourteen study sites convenience sampling was used. Thus all mothers

delivering at each of the participating health centres during the 18-month recruitment

period were considered for eligibility. If the woman did not meet the inclusion

criteria, or refused to participate then the next mother on the delivery register was

considered.

1.2.3.6 Data Collection Methods

A team of specially trained nurses at each participating site were responsible

for the recruitment process. Following a written consent from eligible women, a

standardized questionnaire and transcription of hospital records were used to obtain

baseline information on the mother and infant. Vitamin A supplementation was then

administered according to the randomization plan. A follow-up questionnaire was

used to record information such as history of maternal and infant illness, feeding

practices, anthropometric measurements, etc at each follow-up visit. Specialized

questionnaires were also used to gather information on adverse health events such as

hospitalization.

The consenting mothers and their infants would also undergo physical

examinations by trained nurses which included taking various measurements such as

infant weight and length. Blood samples were obtained by venipuncture from the

mothers and by heel prick from the infant. Colostrum at baseline and breast milk on

follow-up visits were collected by manual expression. The blood and milk specimens

Page 17: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

10

were subjected to an array of tests including HIV and milk retinol tests in specialized

laboratories by trained personnel.

1.2.3.7 Outcome Measures

The measurable outcomes for the study included HIV status for both mother

and infant taken at all visits to establish interval of infection. Breast milk specimens

were taken to establish vitamin A status and HIV viral load. Vital status of the

mother and the infant was established in order to determine mortality rates in the

cohort. Morbidity history of the pair was also recorded.

1.2.4 Data Management

At enrolment the participating mother and infant pair were allocated an identity code

linked to the capsule packet given to them and at each follow-up visit all questionnaires and

specimens were further coded by a two-digit visit number and a one letter code for specimen

type. Data collected at all research sites was checked for legibility, completeness and

accuracy by an appointed nurse supervisor before being taken to the Data Entry Shop at

Harare Hospital. All study instruments were double-entered by two data entry clerks using

SPSS-DE for Windows and any discrepancies were resolved by referring to the original hard

copies.

1.2.5 Data Analysis

Descriptive analyses of demographic and clinical characteristics of the mothers and

infants were done for the entire study population and by treatment arm to check success of

randomization indicated by random distribution of these characteristics across all arms.

Page 18: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

11

Continuous and categorical baseline characteristics were compared across the 4 treatment

groups using Kruskal-Wallis and χ2 tests, respectively.

The percentage efficacies of maternal and infant vitamin A supplementation in

reducing postpartum vertical transmission, horizontal transmission and infant mortality were

calculated using the Turnbull method. Confidence intervals (95%) were computed for the

efficacy using 2000 bootstrap samples. Exclusive breastfeeding rates were estimated using

Kaplain-Meier methods and compared across the treatment arms by pairwise log-rank tests.

T-test and linear regression models were also used to investigate the effect of vitamin A

supplementation on the quantity of HIV in breast milk and plasma, on CD4 counts, on the

infant vitamin A status during the first year of life and on the association between acute phase

reactants and serum retinol concentrations.

Logistic regression was used to examine the effect of neonatal and/or maternal

vitamin A supplementation on vertical transmission, horizontal transmission and infant

mortality controlling for selected variables such as maternal age, socioeconomic status, serum

retinol concentration at birth, infant birth weight, nutritional status, among others. Kaplain-

Meirer survival curves and Cox proportional hazards regression models were used to

compare the timing of infection of the infant and the timing of death in the mother-infant

pairs who received vitamin A supplementation and those who did not.

1.3 Critical Appraisal of the Study

1.3.1 Research Primary Objectives

According to the WHO peri-natal transmission study carried out in Harare, mother-to-

child vertical transmission of HIV account for over 90% of all pediatric HIV infections

nationwide and between a quarter to a third of these occur during breastfeeding. Vitamin A

deficiency among HIV seropositive women has been shown to be associated with higher

Page 19: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

12

breast milk viral load and hence higher vertical transmission rates,6,30 suggesting that

maternal Vitamin A supplementation during the immediate postpartum period may reduce the

risk of transmission during breast feeding. This indicates that the primary objectives of the

study were targeting a real problem in the prevention of mother-to-child HIV transmission.

1.3.2 Specific Objectives of the Study

1. The objectives of the study were specific as they clearly stated what was to be done

and for whom the intervention was intended. Mothers and their new born babies

received a specific dose of Vitamin A during the immediate post partum period and

the expected effect was specified.

2. The measurability of the objectives were illustrated in the fact that the anticipated

results were clearly quantified in comparison with the known baseline status. Vertical

HIV transmission and infant mortality reduction of 30% and a 25% reduction in

horizontal transmission in the mothers was expected.

3. The planning stages of the research involved wide reading in the area of interest so as

to learn from the experiences of others. The budget and expert human resources

required for the study were sourced. Thus achievable objectives were set in line with

the secured financial and human resources as well as the predetermined time frame.

4. The objectives of the study were quite relevant considering that at the time the

research was done (1997-2000), strategies to curb the HIV/AIDS pandemic remains

of paramount importance to the health of the nation. Thus the objectives were

relevant to the main goal of HIV transmission.

5. The objectives were referring to the postpartum period, with the outcomes of

horizontal transmission, infant mortality and vertical transmission being measured at

one year and two years respectively, thus making the objectives time-bound. The

Page 20: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

13

study participants were followed up for a specific time period, after which the

outcome of interest was measured.

1.3.3 Study Design

The study design used was a 2x2 factorial randomized, double-masked, placebo

controlled trial. Randomized controlled trials are best suited for investigating the effect of

intervention procedures on outcomes such as death or occurrence of disease.17 That makes it

the most appropriate design for studying the effect of a single dose of Vitamin A

supplementation to mothers and their neonates on HIV vertical and horizontal transmission as

well as infant mortality.

Randomized controlled trials are considered to be the ‘gold standard’ of evidence-

based medicine because they are the only known method that significantly minimizes

selection and confounding biases2 and help infer causality by establishing temporal sequence

between exposure and outcome. In this case, bias was further reduced by double-masking,

placebo control and stratifying by birth weight which is a known confounder, thus further

increasing confidence in the study findings. Since the study was investigating two treatments,

that is maternal and neonatal supplementation, the 2x2 factorial design was ideal as it further

allowed for the evaluation of the interaction that may exist between the two treatments17.

1.3.4 Sample Size

The sample size was sufficient for the anticipated effects especially considering the

calculations were done taking into account all the necessary assumptions and the expected

reduction rates. Longitudinal studies have the risk of loss-to-follow up and the sample size

calculations took into account a retention rate of 90% such that the study would retain its

power.

Page 21: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

14

1.3.5 Sampling Methods

Participants for the study were enrolled at Harare and Chitungwiza Hospitals and all

city health service clinics. For the results of the study to be generalizable to the study

population, the participants had to be truly representative of the target population, thus

sampling methods had to be free of selection bias. Convenience sampling was used at all the

recruitment health facilities mainly due to the large numbers that were required for the trial so

every woman delivering at any of the study sites was considered for participation if they met

the inclusion criteria. This type of sampling is inexpensive, easy, fast and subjects are readily

available. However, convenience sampling introduces sampling bias into the study since the

resulting sample is not truly representative of the target population.

The sample for the Zvitambo study was selected only in Harare urban thus excluding

the rural population which has some inherent characteristics that are different from the urban

population and would most likely vary the findings. In addition, the study was health centre

based thus biasing the results to only those mothers with a high health-seeking behavior.

These two factors and the sampling technique used therefore limit the external validity of the

study findings.

1.3.6 Data Collection Methods

Interviewer-administered standardized questionnaires and chart reviews were

appropriately used to collect maternal and infant characteristics. These standardized tools

were appropriate as they managed to collect the same information from all participants in an

almost uniform manner minimizing interviewer bias. The questionnaires also collected

participant locater details such that in case of missed clinic appointments home visits could

be made.

Page 22: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

15

The research objectives involved determining HIV status of both mother and infant,

their vitamin A status and breast milk viral load at regular time points hence the collection of

blood and milk specimens was very appropriate. The use of specially trained nurses and

laboratory personnel in data collection ensured a high quality of specimen collection and

diagnostic testing on which the credibility of the study findings were hinged.

1.3.7 Data Analysis

An intent-to-treat analysis was carried out as all (14110)mother-infant pairs who were

randomized were included in the analysis. This is the classic analytic approach for any

experimental study as it measures the effectiveness of the intervention under everyday

practice conditions, hence was the most ideal for this study. Just as required for any

randomized control trial, the first step was to carry out descriptive analyses of demographic

and clinical characteristics of the entire population and by treatment arm to examine success

of the randomization process and this was appropriately done. Comparison of these

characteristics across treatment arms was done using Kruskal-Wallis and χ2 tests for

continuous and categorical variables, respectively and these were relevant as it was necessary

to compare the distribution of all other characteristics across treatment arms.

The percentage efficacies of maternal and infant Vitamin A supplementation in HIV

transmission and mortality was calculated using the Turnbull method which was appropriate

for the type of data whereby the exact time of censoring was not known but instead only the

interval within which the event occurred was known. To establish the effect of Vitamin A

supplementation on HIV transmission and infant mortality, logistic regression techniques

were used to control for other potential predictors. This technique was used appropriately

considering that the predictor variables were a combination of categorical and continuous

variables, some of which were not necessarily normally distributed.

Page 23: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

16

Survival analysis using the Kaplain-Meier methods and Cox proportional hazards

regression was done. This type of analysis was appropriate since it was necessary to compare

survival in the different treatment arms in terms of HIV infection as well as infant death. It

was of interest to determine the probability of remaining infection-free and the probability of

the infant surviving beyond a specific time period with respect to the treatment arm.

Comparisons of continuous variables such as milk, serum viral load and vitamin A status

across study arms were done using t-test, ANOVA and linear regression. These statistics

were relevant considering that the variables under consideration were continuous and largely

normally distributed with possible calculation of means across the study arms.

1.4 Quality of Data

The integrity of any study is hinged on the quality of the data collected as this

represents the actual study findings. The quality of research data refers to its state of

completeness, relevancy, internal and external validity, consistency, timeliness and accuracy

which makes the data appropriate for a specific use. The critical appraisal of the Zvitambo

study highlighted the measures that were taken during data collection and handling which

ensured that the resulting data was of very high quality. The data was collected using

standardised questionnaires which were administered by trained personnel, double data entry

was implemented with physical verification used in cases of any discrepancies. The data was

edited by way of manipulating some variables, cleaned, verified and validated thus enhancing

its quality. The researcher hence has a very high level of confidence in the quality of the data

used in this research and this being enhanced by the fact that even more than a decade after

the study was completed the data was being used to answer different research questions

including its use in international meta-analysis.

Page 24: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

17

1.5 Problem Statement

Discriminant analysis answers the question “What is the probability of correctly

classifying an observation”, whilst logistic regression answers “What is the probability of

success given a set of covariates”4. The two questions are so similar such that either

technique can be used effectively to answer questions for which the other is designed. That

being the case, it may be worthwhile to study how the predictive accuracy of the two

techniques compare. On the other hand, discriminant function estimators may be used as an

exploratory stage in the process of fitting a logistic regression model.14 In this case it is of

interest to ascertain whether combining the two techniques improves the predictive power of

the resulting logistic regression model.

Page 25: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

18

CHAPTER TWO: LITERATURE REVIEW

2.1 Stunting

Wamani et al(2007) carried out a meta-analysis of 16 demographic and health surveys

of 10 sub-Saharan countries and noted that generally boys tend to be more stunted than girls

in the same age groups of under five years.20 They calculated a pooled estimate of the mean

z-scores which was statistically different for boys and girls and the prevalence of stunting

was also higher among boys than among girls. One of the studies by the same author

revealed that the differential in stunting rates between boys and girls was more pronounced in

the lower socio-economic groups than in the well- to-do groups.27

Studies carried out in several countries to investigate the risk factors for stunting in

children identified a variety of biological, socioeconomic, behavioural and environmental risk

factors. Some of the studies established that different risk factors are at play at different

stages of a child’s development. Studies carried out in the Phillipines and Indonesia indicate

that the principal risk factors for stunting below six months of age are maternal behaviours

and child biological characteristics e.g breast feeding status, sex and birth weight whilst after

six months, household socioeconomic status, behavioural and biological characteristics

become important e.g father’s education or occupation, age and sex.15,16 Studies done in

Brazil, China, India and Nigeria also noted that sanitation in the area, mother’s age, birth

interval, family size and attendance of public schools were among the risk factors for

stunting.11,13,17,19 Interventions targeted at women and children with special needs were

proposed such as increasing women’s access to education and prenatal care, encouraging

exclusive breastfeeding and family planning and interventions targeted at low birth weight

babies.16

Page 26: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

19

All the studies reviewed used a cross-sectional study design whereby the various

anthropometric measurements were taken and questionnaires were used to collect other

historical information, behaviours and socioeconomic details. Multi-stage stratified sampling

was employed in the selection of participants15,16 mainly to ensure that risk factors were

examined within the same age groups and environmental setting (rural/urban). In the data

analysis, univariate analysis was first done to establish associations between stunting and

each of the risk factors. Logistic regression in all cases was the multivariable technique used

to check which predictor variables were independently associated with the outcome adjusted

for the effect of the other risk factors.

2.2 Logistic Regression and Discriminant Analysis

All the studies reviewed on stunting, used logistic regression as the multivariate

technique to determine the major risk factors for stunted growth in children. This is possibly

due to the robustness and the fact that the risk factors being investigated are a mixture of

categorical and continuous variables. An additional factor that enhances the utility of logistic

regression when compared with discriminant analysis is the assumption of normality and

equal variances of the predictor variables which is hardly satisfied in practice.14 However,

when the normality assumption is valid the discriminant analysis estimator is more efficient

than the logistic regression estimator5,14. Contrary to popular belief that logistic regression is

limited to two categories for the dependent variable Hosmer and Lemeshow (1989) have

proven that it can be applied to situations with more than two categories. On the other hand

discriminant analysis is believed to be applicable only when independent variables are

interval scaled but research has also revealed that it can be used with both continuous and

categorical data.10,18These developments set a basis upon which the two multivariable

techniques can be compared in answering the same question using the same data set and

establish if their predictive accuracy is comparable.

Page 27: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

20

Press and Wilson (1978) investigated what has to be considered when making a

choice between using logistic regression and discriminant analysis after noting that the two

methods may be used to answer the same research question. They first outlined the

theoretical differences between the two methods which include the underlying assumptions

and the distinct methods of estimating coefficients and then apply the two methods to some

empirical data. The resulting models were then compared and performance of each technique

was determined by a classification rate, in which case logistic regression outperformed

discriminant analysis. This was alluded to the fact that whenever the assumption of normality

of covariates is violated, discriminant analysis performs poorly.14

Pohar et al (2004) compared logistic regression and discriminant analysis using

simulated data and in addition to the classification error rate criteria they also used some

indexes adopted from Harrell and Lee (1985).12 The indexes simply called A, B, C and Q

were noted to be statistically more efficient than the classification error rate as they reveal

how well each model discriminates between the groups and how good the prediction is.29

Antonogeorgos G, et al (2009) carried out logistic regression and discriminant analyses to

evaluate factors associated with prevalence of asthma among children whereby they intended

to evaluate the divergence and similarity of the two statistical techniques. The study used

cross-sectional anthropometric and lifestyle data from 10-12 year old children in Greece and

related them to the presence of asthmatic symptoms. Logistic regression and discriminant

analyses produced similar models upon comparison of sign and magnitude of coefficients, the

area under the Response Operating Characteristic (ROC) curves which indicates

classification accuracy by plotting sensitivity against specificity of the model .28

Montgomery et al (1987) compared the two techniques using two data sets on the

predictor variables of coliform mastitis on dairy cows. Their study revealed that the methods

identified the same variables as important predictors and were equally useful in classifying

Page 28: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

21

cows as diseased or not diseased though logistic regression had fewer classification errors.

Comparisons were done with respect to the variables selected, order of selection, sign and

magnitude of the variable coefficients, specificity, sensitivity and overall correct

classification rate investigated at varying probability cutoff points. Response Operating

Characteristic (ROC) curves for the two models were compared on the same axes and

revealed that logistic regression had a better classification ability than discriminant analysis ,

that is, it correctly classifies more cases than does discriminant analysis.32 Panagiotakos

(2006) set out to compare logistic regression and discriminant analysis in the prediction of in-

hospital mortality of patients admitted with Acute Coronary Syndrome. Like Montgomery et

al (1987), he compared the selected variables, magnitude and sign of the coefficients,

specificity, sensitivity, classification rate and ROC curves. For statistical generalization, a

non-parametric bootstrap technique was used to estimate both logistic regression and

discriminant analysis estimates. He concluded that the two methods resulted in the same

model but logistic regression gave a better classification rate.33

The studies reviewed indicate that in the majority of situations logistic regression and

discriminant analysis produce models that converge, provided all statistical assumptions for

the two techniques are satisfied. Logistic regression however, becomes preferable given the

usual failure to meet the assumptions of equal covariance and multivariate normality as it

tends to produce more stable estimates and can handle data on any measurement scale.

2.3 Research Questions

1. What are the prevalence and risk factors for stunting in the infants enrolled for the

Zvitambo study?

2. What is the relative predictive accuracy of discriminant analysis and logistic regression

(which analytical method is more reliable in classification of subjects into categories)?

Page 29: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

22

3. Can discriminant analysis be used as an exploratory stage in the fitting of a logistic

regression model in order to enhance predictive power?

2.4 Justification Of The Study

From a review of the literature, discriminant analysis and logistic regression can

apparently be used to answer the same research question although their solutions may be

fundamentally different14. Against this backdrop, it is cause for concern that research hardly

employs discriminant analysis as an analytic technique, and yet it may be a reliable technique

with low classification error when required to predict membership to a given category using a

set of explanatory variables. The proposed analysis aims to establish whether the two

techniques have the same predictive accuracy and therefore utility or whether they are

complementary to the effect that their use in combination rather than individually improves

predictive power. Some studies that compared the two statistical techniques used simulated

data sets, so it would be worthwhile to compare the two using real research data.

2.5 Research Objectives

1. To determine the prevalence and risk factors for stunting in the Zvitambo cohort.

2. To compare the predictive accuracy of logistic regression and discriminant analysis in

the prediction of stunting in children.

3. To determine whether discriminant analysis can be used as a preliminary exploratory

step to logistic regression to improve prediction power.

Page 30: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

23

CHAPTER THREE: METHODOLOGY

3.1 Description of Data

The data that was used in this research was collected by the ZVITAMBO study –

“Vitamin A supplementation of breast feeding mothers and their neonates at delivery: Impact

on mother-to-child HIV transmission during lactation, HIV infection among women during

the post-partum year, and infant mortality.” This was a two year follow-up study carried out

in Harare with enrolment of mothers and their infants taking place within 96 hours of

delivery and follow-up visits at 6 weeks and at 3 monthly intervals thereafter, with the first

recruitment done in 1997 and the last follow-up in 2000. Maternal and infant characteristics,

blood and breast milk specimens were collected at baseline and follow-up visits at Harare and

Chitungwiza hospitals and at twelve city health service clinics. Maternal variables measured

included among others, age, level of education, nutritional status, employment status,

monthly income, birth interval, knowledge of feeding practices, morbidity history and their

partner’s level of education and employment status. Infant characteristics that were collected

included gestational age, delivery method, sex, length, weight, feeding practices and

morbidity history. These variables constitute the covariates that were used in modeling

stunting in the infants using logistic regression and linear discriminant analysis.

The data that was used is longitudinal in nature with observations being recorded at 3-

monthly regular intervals, thus the outcome of interest, stunting may be measured at different

stages of the infant’s development. For the purposes of this secondary data analysis, stunting

was considered at 12 months. Thus a cross sectional analysis at that particular time interval

was done as literature has revealed that different factors may be associated with stunting at

the different stages of child development.15

Page 31: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

24

3.2 Sample Size

The original study required a minimum sample size of 14000 but enrolled a total of

14110 participants thus ensuring that the study maintained its anticipated power regardless of

possible missingness. The sample size used for this analysis was 9555, which was the total

number of observations in the Zvitambo data set less those participants who had no

anthropometric measurements at 12 months and any other missing variables at baseline. This

sample size was deemed adequate for the purposes of this research based on three

components, namely required sample size for applying the two analytical techniques, logistic

regression and discriminant analysis as well as the cross-sectional design adopted in handling

the application data.

Using the Dobson’s formula for the calculation of sample size for a cross sectional

study design, a minimum sample size of 340 was calculated for a precision of 0.05, 95%

confidence level(z = 1.96) and stunting prevalence of 33%(from the 2010-11 ZDHS Report),

that is: n =

=

= 339.7511

Discriminant analysis requires that the minimum sample size be at least five times the

number of categories with at least 20 cases per category28, thus giving a minimum sample

size of 40 since there were two categories. Logistic regression is not very sensitive to sample

size such that a sample sufficient for the study design employed would be adequate for the

use of the technique. Therefore, a sample of 9555 was adequate for the purposes of this

project, as it was well above 340, the highest of the minimum sample sizes required by any of

the three criteria. One of the objectives of the secondary data analysis was to estimate the

Page 32: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

25

prevalence of stunting in the Zvitambo cohort, hence the use of 9555 observations which

constituted all the infants in the study who had anthropometric measurements at 12 months.

3.3 Secondary Data Analysis Variables

The Zvitambo study collected a wide range of variables, up to two hundred in total

(recruitment and follow up variables), some of which were not relevant to this research

analysis hence the relevant variables had to be identified. The identification process was

aided by a review of related literature on stunting which gave an indication of potential risk

factors. The outcome measure for this study is stunting but this was only determined after

some calculations using infant length, age and sex. The risk factors extracted from the

original data were mode of delivery, sex of infant, birth weight, gestational age, maternal age,

maternal and partner’s occupation, maternal housing, maternal and partner’s years of formal

schooling, family income, birth interval, number of live births, infant morbidity history and

breast feeding status. A new variable dictionary was produced for this research as illustrated

in Table 1.

3.4 Data Management

The variables identified were used to identify the appropriate tables in the database.

Dbase Plus 8 was used in the Zvitambo study but this research used Stata 12, hence the data

used in the analysis was imported from Dbase to Stata12. Some of the variables were not in

the format suitable for analysis so the data had to be cleaned. This involved dropping some

participants who had some missing variables in their data, recoding some variables,

generating new variables by manipulating existing ones, merging several data sets. The main

outcome variable, stunting was not available explicitly in the data so it had to be created by

using the WHO Anthropometric Calculator which requires the length/height, age and gender

of the infant to calculate the height-for-age Z-scores. Classification as either stunted or not

Page 33: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

26

stunted was done by considering whether the Z-score was below or above 2 standard

deviations respectively. The original data sets were maintained so as to provide back-up

copies.

3.5 Statistical Analysis

Descriptive analysis of the data was done to give a general description of the

demographic characteristics of the study participants. Univariate analysis was carried out in

order to establish which independent variables were individually associated with stunting.

Chi square test for independence was used for categorical variables whilst the Kruskal Wallis

test was used for the continuous ones. The two statistical methods, logistic regression and

discriminant analysis were then applied to compare their classification and predictive

abilities.

In order to validate the variables that are really important in the model and to allow

for statistical generalization a non-parametric bootstrap estimation procedure32 with 20

samples from the main data set was used for both logistic regression and discriminant

analysis. Stepwise logistic regression and canonical discriminant function analysis were

performed on the randomly selected part (50%) of the data set using STATA 12. The

continuous sampling and estimation allows for validation of the models. For logistic

regression the significance levels for entry and removal were set at 0.20 and 0.30 respectively

and the variables retained in more than 60% of the models were noted as important risk

factors. For discriminant analysis, variables with a correlation greater or equal to ±0.3 in the

canonical structure were retained as important covariates in predicting stunting. The means

of the coefficients obtained from the bootstrap samples were taken as the best estimates for

the two models. For the logistic regression parameter estimates, the corresponding bootstrap

confidence intervals for the means were also calculated.

Page 34: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

27

Table 1: New Data Dictionary of Variables used in Secondary Analysis Variable Name

Variable Definition

Variable Type

Coding

PTID Participant ID

Nominal

a05 Mode of delivery Nominal 1=Normal vaginal 2=Breech vaginal 3=Emergency C-section 4=Elective C-section 5=Forceps 6=Vacuum

a14 Gender Nominal 0=Male 1=Female

a16 Birthweight(grams) Continuous

ga_clc Gestational age(days)

Continuous

mom_age Mother’s age(years)

Continuous

edu_mom Mother’s years of education

Continuous

a20 Maternal occupation

Nominal 1= Domestic/Unskilled worker

2=Skilled Manual

3=Clerical

4=Professional

5=Vendor

6=Unemployed(housewife)

7=Other

edu_partner Partner’s years of education

Continuous

a23 Partner’s occupation

Nominal 1= Domestic/Unskilled worker 2=Skilled Manual 3=Clerical 4=Professional 5=Vendor 6=Unemployed(housewife) 7=Other

a25 Maternal housing Nominal 1=Own 2=Rented 3=Lodge 4=Extended family 5=Employer provided 6=Other

a26_std Household income(Z$)

Continuous

a44 Number of live births

Continuous

birth_int Birth interval Continuous

Page 35: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

28

Several diagnostic tests were performed on the logistic regression model and the

discriminant analysis model to check for validity of assumptions. Stepwise logistic

regression has an in-built mechanism to test for multicollinearity and drops any variables that

would distort the validity of the model. Overall fit of the logistic regression model was tested

using the likelihood ratio test, p-value < 0.05 representing statistical significance. Overall

significance of the discriminant function was tested using the Wilk’s Lambda test,

multivariate normality test was used to check for normality of the covariates and the

multivariate covariance test was used to test for equal covariance matrices.

The resulting models were used for classification, varying the cut-off points or prior

probabilities and noting how the sensitivity, specificity and overall classification rate(total

correct classification percentage) varies at each cut-off point. After investigating the models’

predictive abilities at different cut-off points, the summary statistics were compiled and

compared. The comparisons were based on the following aspects; the variables selected in

the models, the sign and magnitude of the coefficients, sensitivity and specificity of the

classifications, the overall classification rate at varying cut-off probabilities.32 The Response

Operating Characteristics(ROC) curves were also compared to determine which of the two

models enclosed a larger area indicating a better classification ability. The ROC curve plots

sensitivity(rate of true positives) and 100 minus specificity(rate of true negatives) at several

cut-off points and so provides a quick graphical assessment of the effect of varying the cut-

off point in any classification model.

To check whether the two methods can be used in combination to improve

classification accuracy, discriminant analysis was applied as an exploratory step to identify

those variables that strongly discriminate stunted children from their non-stunted

counterparts. These covariates were then subjected to logistic regression and the resulting

Page 36: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

29

model was used in the prediction of stunting in the validation set and predictive accuracy was

compared to the results of the first logistic regression model.

3.6 Ethical Considerations

Approval to carry out the research was sought and granted by the Joint Research

Ethics Committee, approval letter in Appendix A. Permission to use the data was granted by

Zvitambo Institute of Maternal and Child Health, letter of authorization in Appendix B. The

Zvitambo trial was approved by the Medical Research Council of Zimbabwe (MRCZ).

Page 37: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

30

CHAPTER FOUR: RESULTS

4.1 Demographic Characteristics of the Participants

Of the 9555 participants included in this study, 51% were males and the overall

prevalence of stunting in this Zvitambo cohort at 12 months of age was 22%, thus lower than

the national prevalence of 33%. Of the stunted infants, 62% were males indicating males tend

to be more stunted than females. Most of the babies were full term with a mean gestational

age of 275 days and a standard deviation of 10 days, had normal birth weights 2992 grammes

and a standard deviation of 449 grammes and most had a normal vaginal delivery (89%).

Almost all the babies (98%) were breast fed for some time within the two years follow-up. It

is also evident that the majority of the infants were born into low socioeconomic households

with unemployed mothers (81%), partners who did skilled or unskilled manual jobs (66%),

had an average monthly income of US$84.23 and many stayed in lodged accommodation

(59%). The majority of mothers and their partners were literate and had completed an

average of seven years of formal education.

4.2 Univariate Analysis

Table 3 shows results of the univariate analysis performed to investigate whether the

predictor variables have any individual association with the outcome, stunting. The following

factors were found to be significantly related to stunting (p-value < 0.05): sex, birth weight,

gestational age, mother and partner’s education, partner’s occupation and having fever.

Page 38: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

31

Table 2 : Demographic Characteristics of the Participants

Characteristic Summary Statistic Categorical Variables Frequency N (%) Infant Gender Male

Female 4909 (51.38) 4646 (48.62)

Delivery mode Normal vaginal Breech vaginal Emergency C-section Elective C-section Forceps Vacuum

8427 (88.93) 122 (1.29) 597 (6.30) 210 (2.22) 2 (0.02) 118 (1.25)

Breastfeeding Ever breast-fed Never breast-fed

9365 (98.05) 186 (1.95)

Maternal Housing Own Rented Lodge Extended family Employer- provided Other

1017 (10.65 ) 141 (1.48) 5632 (59.00 ) 2156 (22.59) 541 (5.67) 58 (0.61)

Maternal Occupation Domestic/Unskilled Skilled manual Clerical Professional Vendor Unemployed Other

502 (5.26) 427 (4.48) 148 (1.55) 109 (1.14) 273 (2.86) 7864 (82.48) 212 (2.22)

Partner Occupation Domestic/Unskilled Skilled manual Clerical Professional Vendor Unemployed Other Don’t know

2526 (26.50) 3806 (39.93) 624(6.55) 875(9.18) 708 (7.43) 285 (2.99) 577 (6.05) 130 (1.36)

*Stunting-12 months Stunted Not stunted

1970 (21.58) 7160 (78.42)

Continous normal Mean ±SD Birth weight (grammes) 2992.386 ± 449.3237

Gestational Age (days) 275.0207 ± 9.8941

Mom’s education(yrs) (years) 9.9130 ± 2.1823

Partner’ education(yrs) (years) 11.1276 ± 2.0930

Continous non-normal Median (Q1; Q3) Birth interval (years) 3.9055 (2.7652 ; 5.5305) Mom’s age (years) 23.6468 (20.5969 ; 27.6222) Number of Live births 2 (1 ; 3) Family Income (USD) 84.23 (54.62 ; 141.54)

Page 39: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

32

Table 3: Results of Univariate Analysis to establish association of individual variables with stunting. Variable Stunted

(stats) Non-stunted (stats)

p-value

Sex+ Males Females

1225 745

3476 3684

<0.001**

Delivery Mode+ Vaginal Non-vaginal

1762 208

6407 753

0.957

Mom’s Occupation+ Skilled Unskilled Unemployed

122 168 1633

534 570 5879

0.124

Partner’s Occupation+

Skilled Unskilled Unemployed

1006 744 65

4037 2361 212

<0.001**

Maternal Housing+ Own/Rented Other

324 1646

1281 5879

0.136

Breastfeeding+ Ever breastfed Never breastfed

1925 44

7026 132

0.264

Birth Weight* Grammes(mean) 2775.23 3055.94 <0.001**

Birth Interval*

Years(median) 3.72 3.95 0.2117

Gestational Age* Days(mean) 272.52 275.72 <0.001**

Mom’s Age* Years(median) 23.54 23.69 0.1624

Mom’s Education* Years(mean) 9.59 9.99 <0.001**

Partner’s Education* Years(mean) 10.88 11.19 <0.001**

Family Income* USD(median) 74.00 85.96 0.0579

Number of Live Births*

(median) 2 2 0.0905

Note: +χ2,*Kruskal-Wallis, **significant at p<0.05

4.3 Logistic Regression Model

Table 4 shows the variables that were identified by logistic regression analysis. After

carrying out the bootstrap estimation, those variables that were retained in more than 60% of

the models were considered to be important in the prediction of stunting in children. These

factors were gender, birth weight, household income, birth interval and the mother’s level of

Page 40: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

33

education. From the signs of the coefficients of the variables, they all indicate a protective

effect against stunting. The 95% confidence intervals do not cross zero, thus indicating that

all these retained factors are significant, and they are relatively narrow implying very precise

estimates.

Table 4: Results of Bootstrapped Logistic Regression Variable Coefficient 95% CI

Gender -0.83439

-0.8733821; -0.7953979

Birth Weight -0.001585 -0.0016403; -0.0015297

Household Income -0.0001021 -0.0001153; -0.0000889

Birth Interval -0.0493 -0.0622183; -0.0363817

Mother’s Education -0.0537154 -0.065569; -0.0418618

Constant 5.7025 5.0328; 6.3721

The likelihood ratio test (p-value < 0.001) indicates overall statistical significance of the

model but however, the pseudo R2 value of 0.1 shows that the model only accounts for 10%

of the variability in stunting.

4.4 Linear Discriminant Analysis Model

Table 5 shows the variables that linear discriminant analysis identified as important

risk factors for stunting. A variable is considered as an important variable in the

discriminatory model if its correlation with the linear discriminant function (canonical

structure) is greater or equal to 0.3 in either direction. The three factors that were identified

as having significant discriminatory power were gender, birth weight and gestation age. All

three factors were protective against stunting as they all have negative coefficients. The

linear discriminant function was statistically significant (p-value<0.001 for the F statistic

Page 41: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

34

derived from the Mahalanobis distance) thus the two categories of stunting were statistically

different. However, both the multivariable test for normality and for equal covariances were

significant (p-value < 0.001) implying that the two assumptions were violated. Normality

transformations were applied to the covariates but yielded no significant change.

Table 5: Results of Bootstrapped Linear Discriminant Analysis Variable Standardised Coefficient Canonical Structure

Gender -0.50229

-0.33464

Birth Weight -0.86139

-0.80483

Gestation Age -0.08688

-0.38873

4.5 Comparison of Logistic Regression and Linear Discriminant Models

After fitting the two models, they were compared on the basis of the variables

selected, the sign and magnitude of the coefficients, sensitivity, specificity, overall

classification accuracy and the areas enclosed under their respective ROC curves. Thus the

cut-off or probability points were varied from 0.1 to 0.9 and the resulting attributes at each

point were recorded and illustrated in Table 6. The cut-off points are such that if the

probability of being stunted is less than or equal to that value the individual is classified as

not stunted, otherwise they would fall in the stunted category.

With reference to Tables 4 and 5, the variables selected by the two models have two

variables only in common, namely gender and birth weight. Logistic regression in addition to

the two, also identified household income, birth interval and mother’s education as important

predictors of stunting whilst linear discriminant analysis picked gestational age as the other

important variable. Both methods yielded negative coefficients for all the factors identified as

important thus implying protective effects. The magnitude of the coefficients of the

Page 42: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

35

common variables, gender and birth weight were different, thus indicating different degrees

of discriminatory or predictive power in each of the models.

Table 6: Comparison of Logistic Regression and Linear Discriminant Analysis in terms of Sensitivity, Specificity and Classification Accuracy Cut-off Point*

Logistic Regression Linear Discriminant Analysis

Sensitivity Specificity Accuracy Sensitivity Specificity Accuracy

0.1 94.7 21.2 37.5 100 0.52 22.05

0.2 70.59 59.37 61.86 98.72 6.99 26.84

0.3 44.86 82.16 73.89 93.69 22.41 37.83

0.4 23.68 92.98 77.61 81.83 44.12 52.28

0.5 13.19 97.44 78.76 62.83 65.33 64.79

0.6 5.51 99.14 78.37 41.58 83.37 74.33

0.7 1.95 99.88 78.16 22.38 93.37 78.01

0.8 0.11 100 77.85 8.68 98.24 78.86

0.9 0.00 100 77.82 1.44 99.84 78.55

*P(stunting): scores greater than the cut-off point are classified as stunted, whilst those less than or equal to the cut-off point are classified as not stunted.

Comparing the two models, it was noted that at all cut-off points the linear

discriminant model has a higher sensitivity (proportion of true positives among all the

positive results) than the logistic regression model whilst the logistic regression model has

higher specificity (proportion of true negatives among all the negative results) at all levels. In

terms of the overall classification accuracy rate, the logistic regression model performed

better than the linear discriminant model at most of the cut-off points except at the 0.8 and

0.9 levels where the linear discriminant model performed slightly better than its counterpart.

When both methods perform at their maximum classification accuracy (logistic-78.76%,

Page 43: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

36

discriminant-78.86%), they have an extremely high specificity (logistic-97.44%,

discriminant-98.24%) and low sensitivity (logistic-13.19%, discriminant-8.68%), implying

that both models would be very good at identifying subjects without the condition. Though

attained at different cut-off points, the two methods have almost the same maximum

classification accuracy rate (78%) which is classified as acceptable discrimination according

to Hosmer and Lemeshow (2000).

The area enclosed under the ROC curve represents the classification ability of a

model, hence when comparing the ROC curves for the logistic regression model and the

linear discriminant model in Figures 1 and 2 respectively, it was noted that logistic regression

had a superior classification ability as it enclosed an area of 0.7150 compared to 0.7004 for

the linear discriminant ROC curve. However, their level of performance is not significantly

different (0.0146).

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 - Specificity

Area under ROC curve = 0.7150

Figure 1: Receiver Operating Characteristics (ROC) curve for Logistic Regression model.

Page 44: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

37

0.2

5.5

.75

1S

ensi

tivity

0 .25 .5 .75 11 - Specificity

Area under curve = 0.7004 se(area) = 304.0297

Figure 2: Receiver Operating Characteristics (ROC) curve for Linear Discriminant Analysis model.

4.6 Linear Discriminant Analysis as an Exploratory Step for Logistic

Regression

Linear discriminant function analysis works by classifying a subject by calculating its

discriminant score as a linear combination of the strongly discriminating covariates. Thus

using these identified variables in a logistic regression model could possible improve

classification ability, hence the three factors in the linear discriminant model were fed into a

logistic regression. The resulting model was compared to the initial logistic regression model.

Page 45: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

38

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 - Specificity

Area under ROC curve = 0.7024

Figure 3: Receiver Operating Characteristics (ROC) curve for the second Logistic Regression model.

Table 7: Comparison of the Two Logistic Regression Models

Attribute Logistic Regression Model 1 Logistic Regression Model 2

Log Likelihood -1986.4762 -4286.5091

P-value(χ2) 0.0000 0.0000

Pseudo R2 0.0999 0.0884

Sensitivity(at 0.5 cut-off) 13.19% 9.91%

Specificity(at 0.5 cut-off) 97.44% 97.96%

Accuracy(at 0.5 cut-off) 78.76% 78.91%

Area under ROC curve 0.7150 0.7024

Page 46: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

39

Using the likelihood ratio test to compare the two models, the first model still remains

significant (p-value<0.001), the pseudo R2 for the first model is 10% compared to 9% for the

second model, the specificity for the two models is almost equal, model 1 has a slightly

higher sensitivity than model 2, the accuracy for the two models is the same (79%) and the

area under the ROC curves differs slightly (0.7150 and 0.7024) with the first model having

minimal superiority. These statistics thus indicate that using linear discriminant analysis as an

exploratory step to logistic regression does not enhance the prediction ability of the logistic

regression model.

Page 47: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

40

CHAPTER FIVE: DISCUSSION

5.1 Discussion of Results

In general, the classification ability of the logistic regression model was as good as

that of the linear discriminant analysis model. However, logistic regression identified five

variables as important predictors of stunting whilst linear discriminant analysis identified

three factors with two of them in common. The signs of the coefficients was the same for the

two models but the magnitudes of the effects differed significantly. This study could

possibly be the first to try using logistic regression and linear discriminant analysis in

combination though the result did not improve the overall classification accuracy rate.

This study established that the prevalence of stunting in the selected subsample of the

Zvitambo cohort was 21.58%. The two methods collectively identified the following as

important predictors of stunting: gender, birth weight, birth interval, household income,

mother’s education level and gestational age. Upon comparison of their predictive accuracy,

logistic regression proved to be slightly more superior than linear discriminant analysis

though the overall classification rate for both methods could be classified as acceptable (lying

between 70% and 80%).4 When used in combination, thus applying logistic regression to the

predictors identified by linear discriminant analysis as highly discriminatory it was noted that

the predictive accuracy does not improve.

Several studies have looked at the risk factors of stunting in children in several

countries and they have identified a whole range of predictors. The finding that gender is an

important risk factor agrees with the meta-analysis carried out by Wamani et al (2007) of data

from 16 Demographic and Health Surveys from Sub-Saharan African countries of which

Zimbabwe was a part. They revealed that male children may actually be more vulnerable to

health inequalities than females of the same age group. In their study in Indonesia, Ramli et

Page 48: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

41

al (2009) also identified male sex as a risk factor for stunting. Low birth weight emerged as

one of the principal risk factors for stunting just as it was also noted by Ricci et al (1996) in

the Phillipines and by Vitolo et al (2008) in Brazil. The same study by Vitolo et al (2008)

identified family income as a risk factor and that agrees with findings by Senbanjo et al

(2011) in Nigeria who also established that the mother’s level of education was also

associated with stunting in their children. This possibly being as a result of the feeding

practices and hygienic tendencies which are the mother’s responsibility and affect the child’s

development. The risk factors of stunting identified by this research concur with findings of

other studies on the same condition. Since stunting is a health indicator of chronic

malnutrition, identification of its risk factors would assist the crafting of public health

interventions aimed at reducing its impact.

The predictive accuracy of logistic regression and linear discriminant analysis were

compared and the two methods had a negligible difference in the maximum overall

classification rate, logistic regression-78.76% versus 78.86% for linear discriminant analysis.

This finding contrasts other studies that made the same comparison whereby logistic

regression emerged with slight superiority over linear discriminant analysis. Panagiotakos

(2006) compared the two methods in predicting death or survival of in-patients admitted with

Acute Coronary Syndrome (ACS) and found out that logistic regression had a maximum

overall classification rate of 96.8% compared to 81.4% for linear discriminant analysis.

Montgomery et al (1987), Antonogeorgos G, et al (2009) and Press and Wilson (1978) noted

that logistic regression had a maximum classification rate of 82.2%, 79.2% and 80%

compared to 77.5%, 77.4%, and 68% for linear discriminant analysis, respectively. However,

when comparing areas under the ROC curves for the two statistical methods, this research

revealed that logistic regression was just slightly better in its classification ability thus 71.5%

compared to 70.24%, whilst Panagiotakos (2006) and Antonogeorgos G, et al (2009) found

Page 49: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

42

no differences in the areas, 81.8% and 74.6% for logistic model versus 81.1% and 74.4% for

discriminant model respectively. However, in the comparison of the two statistical methods

it should be noted that the assumptions of multivariate normality and equal covariance

matrices within the categories for the application of linear discriminant analysis were violated

just as observed in most studies that used real research data.28,32

It was noted in this study that at the cut-off points where logistic regression and

linear discriminant analysis perform best, they both had very low sensitivity and high

specificity in agreement with the findings of Antonogeorgos et al (2009), Montgomery et al

(1987) and Panagiotakos (2006). This revealed that both models would perform excellently in

identifying children who are not stunted, implying that whenever any of the two models

classifies someone as having the condition it would be highly likely that indeed that person

has the condition.35

Linear discriminant analysis classifies subjects on the basis of their characteristics of

some independent variables which help discriminate among the groups. The variables

identified as highly discriminatory by linear discriminant analysis were used in logistic

regression to check if classification ability would improve. The resulting model was not any

better in performance (70.04% compared to 71.5%), hence there was no value addition in

applying the two statistical methods in combination. However, none of the studies reviewed

used the two methods in combination.

The findings of this study identified some risk factors for stunting which are

important for public health policy formulation, specifically with reference to health

promotion interventions aimed at reducing national stunting prevalence. The comparison

between logistic regression and linear discriminant analysis which resulted in negligible

differences in classification abilities would help researchers when faced with making a choice

between the two methods. In this study the two methods converged, possibly because of the

Page 50: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

43

large sample size that was used, as literature states that the two tend to produce similar results

as sample size increases beyond 50.28 However, it should still need to be noted that the most

appropriate method to be used for analysis should always be the one whose assumptions are

not seriously violated.

5.2 Limitations of the study

The findings of this study were possibly compromised by the violation of assumptions

of multivariate normality and equal covariances within the groups which normally produces

unstable estimates for linear discriminant analysis. This violation may have been caused by

the inclusion of categorical variables, some of which had just two categories(e.g gender)

which does not conform to the requirement that categorical variables may be used in linear

discriminant analysis if they have many categories (5-6). The research was unable to

compare computation time taken by the two methods as that comparison could also help

researchers when choosing which method to use for analysis and this was done in similar

studies. Another point of comparison would have been the order of selection of variables into

the model but this was not possible as linear discriminant analysis used a simultaneous

approach and not a stepwise approach that was used for logistic regression analysis. External

validity of the findings of this study was limited by the fact that the data that was used was

collected from urban infants and so cannot be generalized to all infants of the same age.

Using data collected by a national survey such as the Demographic and Health Survey would

help overcome this shortcoming. Despite these limitations, this study remains important in

that it validated findings of studies carried out in other countries and identified some

important risk factors for stunting and estimated its prevalence in Harare, Zimbabwe.

Page 51: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

44

CHAPTER SIX: CONCLUSION

The two statistical methods, logistic regression and linear discriminant analysis

proved to have the same classification ability. Convergence was evidenced by their overall

classification rates that were equal. At their maximum classification ability level both

methods were observed to have very high specificity and low sensitivity, implying that when

a subject was classified as positive by any of the methods it would be highly likely that they

indeed were positive. It was established that there was no value added in classification ability

when the two methods were used in combination, hence the choice to use any one of them

would have to be made by considering the assumptions for the application of each of them.

The similar signs for coefficients of the two methods indicate convergence with respect to the

direction of the effect of the factors identified. Logistic regression identified sex, birth

weight, birth interval, household income and mother’s education level as important predictors

of stunting, whilst linear discriminant analysis identified sex, birth weight and gestational

age.

This study would help researchers especially statistical analysts that when making a

choice of the statistical method to use it is of fundamental importance to consider the

assumptions for the application of each as violation of assumptions may distort results. The

results of this research have also shown that linear discriminant analysis and logistic

regression tend to converge in their classification capability when the sample size is

large(>50).

Page 52: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

45

REFERENCES

1. Agresti A. (2007).An Introduction to Categorical Data Analysis,2nd Edition, John Wiley & Sons, Canada.

2. Evans J. (1998). Epidemiology in Practice: Randomized Controlled Trials; community Eye Health; 11(26): pp 26-27.

3. Freedman K.B, Back S. and Bernstein J. (2001). Sample Size and Statistical Power of Randomized Controlled Trials in Orthopaedics; The Journal of Bone and Joint Surgery (Br) 2001; 83-B(3): 397-402.

4. Hosmer D.W and Lemeshow S. (2000).Applied Logistic Regression,2nd Edition, John Wiley and Sons, Canada.

5. http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/discrim.pdf, 13/06/2014 6. Humphrey J.H et al (2006). Effects of a Single Large Dose of Vitamin A Given

during the Postpartum Period to HIV-Positive Women and their Infants on Child HIV Infection, HIV-free Survival and Mortality, Journal of Infectious Diseases 2006, 193: 860-871.

7. Jiang Y. et al (2014). Prevalence and Risk Factors for Stunting among Children under 3 years in Mid-western Rural Areas of China. Child Care, Health and Development. doi.10.1111/cch.12148.

8. Krzanowski W.J (1986). Multiple Discriminant Analysis in the Presence of Mixed Continuous and Categorical Data, Computers and Maths with Applications, 1986, 12A(2): pp 179-185.

9. Kutner M.H et al. (2005). Applied Linear Statistical Models, 5th Edition, MacGraw-Hill/Irwin, New York.

10. Overall J.E and Woodward J.A, (1977). Discriminant Analysis with Categorical Data, Applied Psychological Measurement 1977, 1(3): pp 371-384.

11. Paudel R. et al (2012). Risk Factors for Stunting among Children: a community-based case control study in Nepal, Kathmandu University Medical Journal 2012, 10(39):18-24.

12. Pohar M, Blas M. and Turk S. (2004). A Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study, Metodološki zvezki 2004, 1(1): pp 143-161.

13. Prendergast A.J et al. (2014). Stunting is Characterized by Chronic Inflammation in Zimbabwean Infants; PLoS ONE 9(2): e86928.doi: 10.1371/journal.pone.0086928.

14. Press S.J & Wilson S. (1978).Choosing Between Logistic Regression and Discriminant Analysis. Journal of the American Statistical Association,73,pp699-705.

15. Ramli et al. (2009). Prevalence and Risk Factors for Stunting among Under Fives in North Maluku Province of Indonesia, BMC Pediatrics 2009;9:64.

16. Ricci J.A and Becker S. (1996). Risk Factors for Wasting and Stunting among Children in Metro Cebu, Phillipines, American Journal of Clinical Nutrition, 63(6):pp 966-975.

Page 53: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

46

17. Senbanjo I.O et al. (2011). Prevalence and Risk Factors for Stunting among School Children and Adolescents in Abeokuta, Southwest Nigeria, Journal of Health and Population Nutrition, 29(4): pp 364-370.

18. Stolberg H.O et al. (2004). Fundamentals of Clinical Research for Radiologists: Randomized Controlled Trials; American Journal of Roentgenology 2004; 183: 1539-1544.

19. Vitolo M.R et al (2008). Some Risk Factors Associated with Overweight, Stunting and Wasting among Children under 5 years old, Journal de Pediatria(Rio J), 2008, 84(3):pp 251-257.

20. Wamani H. et al (2007). Boys are more Stunted than Girls in sub-Saharan Africa: a meta-analysis of 16 Demographic and Health surveys. BMC Pediatrics 2007, 7(17).

21. www.sagepub.com/upm-data/5081_Spicer_Chapter_5.pdf,16/06/2014 22. www.who.int/ceh/indicators/0=4stunting.pdf, 16/06/2014 23. Zvitambo Study Protocol Document. 24. Shrimpton R, et al (2001). Worldwide Timing of Growth Faltering: Implications for

Nutritional Interventions. Pediatrics 2001; 107: 1-7. 25. United Nations Administrative Committee on Coordination/Sub-Committee on

Nutrition. Fourth Report on the World Nutrition Situation: Nutrition Throughout the Life Cycle. Geneva, Switzerland: United Nations Administrative Committee on Coordination/Sub-Committee on Nutrition; 2000.

26. Humphrey J.H, (2009). Child Undernutrition, Tropical Enteropathy, Toilets and handwashing. Lancet 2009; 374: 1032-35.

27. Wamani H, et al (2004). Mothers’ education but not fathers’ education, household assets or land ownership is the best predictor of child health inequalities in rural Uganda. International Journal Equity in Health 2004; 3:9.

28. Antonogeorgos G, et al (2009). Logistic Regression and Linear Discriminant Analyses in Evaluating Factors Associated with Asthma Prevalence among 10-12-Years-Old Children: Divergence and Similarity of the Two Statistical Methods. International Journal of Pediatrics 2009: 952042.

29. Harrell F.E and Lee K.L. (1985). A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In P.K.Sen(Ed.): Biostatistics: Statistics in Biomedical, Public Health and Environmental Sciences. North Holland: Elsevier Science Publishers, 333-343.

30. Nduati R.W, John G.C, Richardson B.A, et al. Human immunodeficiency virus type 1-infected cells in breast milk: association with immunosupression and vitamin A deficiency. Journal of Infectious Diseases 1995; 172:1461-8.

31. Zimbabwe National Statistics Agency(ZIMSTAT) and ICF International.2012. Zimbabwe Demographic and Health Survey 2010-11. Calverton, Maryland: ZIMSTAT and ICF International1.Inc.

32. Montgomery M.E, White M.E and Martin S.W. A Comparison of Discriminant Analysis and Logistic Regression for the Prediction of Coliform Mastisis in Dairy Cows. Canadian Journal of Veterinary Research. 1987; 51(4)pp495.

Page 54: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

47

33. Panagiotakos D.B. A Comparison between Logistic Regression and Linear Discriminant Analysis for the Prediction of Categorical Health Outcomes. International Journal of Statistical Sciences. 2006; 5:pp 73-84.

34. Morrison D.G. On the Interpretation of Discriminant Analysis. Journal of Marketing Research. 1969; 6: 156-63.

35. Akobeng A. Understanding Diagnostic Tests 1: Sensitivity, Specificity and Predictive Values. Acta Paedrica 2006; 96: pp 338-341.

Page 55: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

48

Appendix A: Joint Research Ethics Committee Approval

Page 56: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

49

Appendix B: Letter of Authorisation to use Data

Page 57: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

50

Appendix C: Logistic Regression and Linear Discriminant Analysis

Logistic Regression

Just like multiple linear regression, logistic regression attempts to come up with a

composite function of multiple independent variables that predicts the probability [π(x)] of a

case being in a given category21, in this case the probability of being stunted. The composite

function takes the form:

π(x) = α + βx (1)

This implies that being a probability, the result of the composite function should

always lie between 0 and 1, which is not the case when coefficients are calculated using

ordinary least squares like in linear regression. This method forces linearity to a relationship

that is more likely S-shaped as it gives values outside the 0-1 range. The solution to this

violation lies in introducing another probability index called the odds, which looks at the

probability of being in a category over the probability of not being in that category. Thus the

probability itself is actually the odds divided by one plus the odds yielding the following

mathematical function which is exponential1:

π(x) = eα + βx / 1 + eα + βx (2)

Thus taking the natural logarithm of the odds instead of the probabilities would revert back to

the composite function of the multiple independent variables with all its usual properties

resulting in the logistic regression model:

log [π(x)/ 1 - π(x)] = α + βx (3)

Page 58: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

51

The logistic regression model is a special case of the generalized linear model, with a

binomial random component(stunted or not stunted), a logit link function (log π/1- π) and a

linear systematic component(α + βx) which can assume any value on the real number line.1

The estimation of parameters would produce log odds that in turn result in predictive

probabilities that accurately classify cases into the correct category. The predictive

probabilities and the actual categories give a log likelihood function such that the desirable

coefficients are those that maximize this function. Thus when different logistic regression

models are compared the best model would be that with the highest log likelihood value as it

would have the best predictive power.4 Logistic regression quite easily handles categorical

independent variables by the use of design or dummy variables. With a cut-off probability of

0.5, a case can be classified into one group if the resulting probability is less than 0.5 or into

the other group if the probability is greater than 0.5.28

Logistic regression does not have rigid assumptions to be fulfilled hence its vast applicability

in multivariable analysis. These assumptions include:

• The true conditional probabilities are a logistic function of the independent variables.

• No important variables are omitted.

• No extraneous variables are included.

• The independent variables are measured without error.

• The observations are independent.

• The independent variables are not linear combinations of each other.

Thus, multicolinearity of the predictor variables may need to be checked by treating one of

these variables as a pseudo-dependent variable and regressing all the others against it.

Independency of the individual cases may also need to be verified by using residual plots.9

Page 59: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

52

Linear Discriminant Analysis

Discriminant analysis performs the same task as multiple linear regression, that of

using various independent variables to determine an outcome, in this case which has several

categories (two or more outcome categories)9. It is used to predict group membership or

category membership based on a linear combination of the predictor variables4. The process

can also determine which of the predictor variables discriminate between two or more

naturally occurring mutually exclusive and exhaustive groups, where there is no natural

ordering in the groups14. This method requires that the independent variables be normally

distributed and the covariance matrices should be equal.

The discriminant analysis procedure starts with a set of observations where both

group membership and values for the predictor variables are known. The product of the

process is a model that can be used to predict group membership when only predictor

variables are known14. Another purpose of discriminant function analysis is an understanding

of the data set, since examination of the resulting prediction model can highlight the

relationship between group membership and the variables used for prediction. Predictive

discriminant analysis provides a way of assigning new cases to groups. It uses the new case’s

scores on the predictor variables to predict the category to which the case belongs. Statistical

significance tests using chi-square allow one to see how well a function separates groups.

This analytical technique works by formulating a new variable, the discriminant

function score which is used to predict to which group a case belongs. The result is a

function or equation similar to a multiple linear regression equation:

D = a + v1Age + v2Education + v3Birthweight + ………..+ vn Feeding (4)

Where D = discriminate score(stuntedness);

Page 60: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

53

vi = the ith discriminant coefficient;

Xi = respondent’s score for the ith variable (e.g Age);

a = constant;

n = the number of predictor variables.

After determining the discriminant function, the discriminant score for each

observation is calculated and each case is classified into the appropriate group depending on

the cut-off point used for classification. This technique pivots on a statistic called the

eigenvalue, which is the ratio of the between-group/within-group sum of squares such that the

best discriminant coefficients are those that maximize the eigenvalue for the composite

variable5.

Page 61: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

54

Appendix D: Zvitambo Questionnaire Used for Data Collection

Page 62: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

55

Page 63: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

56

Page 64: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

57

Page 65: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

58

Page 66: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

59

Page 67: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

60

Page 68: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

61

Page 69: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

62

Page 70: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

63

Page 71: DEPARTMENT OF COMMUNITY MEDICINEir.uz.ac.zw/jspui/bitstream/10646/2658/1/RUTUNGA... · National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth reference

64