View
1
Download
0
Category
Preview:
Citation preview
UNIVERSITY OF ZIMBABWE
COLLEGE OF HEALTH SCIENCES
DEPARTMENT OF COMMUNITY MEDICINE
LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSI S
IN THE EVALUATION OF FACTORS ASSOCIATED WITH
STUNTING IN CHILDREN: DIVERGENCE AND SIMILARITY OF
THE STATISTICAL METHODS.
Rutunga L. R944608E
Supervisors: Professor S. Rusakaniko
Mr V Chikwasha
A dissertation submitted in partial fulfillment of the Master of Science
Degree in Biostatistics.
ii
DECLARATION FORM
STUDENT:
I do hereby declare that this dissertation is the original work of LOYCE RUTUNGA and has
not been submitted before to the University of Zimbabwe or any other institution for the
fulfilment of any degree requirements.
Name …………………………………………………………………………………………..
Signature ............................................................................................Date ................................
SUPERVISOR:
I certify that I have supervised the writing of this dissertation and declare that it is indeed the
original work of the student in whose name it is being submitted.
Name …………………………………………………………………………………………..
Signature ............................................................................................Date ................................
DEPARTMENTAL CHAIRPERSON:
I do hereby declare all of the above statements to be true.
Name …………………………………………………………………………………………..
Signature ............................................................................................Date ................................
iii
ABSTRACT
Background: Stunting is a well-established child health indicator of chronic malnutrition
which is associated with biological, environmental and socioeconomic factors. Logistic
regression and linear discriminant analysis are two statistical methods that can be used to
predict or classify subjects as either stunted or not stunted based on all or a subset of
measured predictor variables. The predictive accuracy of the two methods were compared
with respect to several attributes of each of the methods.
Methods: Data used for the study was extracted from the Zvitambo trial data set. The
multivariable logistic regression and linear discriminant models were fitted using 20
bootstrap samples for cross validation of the coefficients. The two models were compared
with respect to the variables selected, the sign and magnitude of the coefficients, sensitivity,
specificity, overall classification rate and areas under ROC curves. The two methods were
applied in combination to check if predictive accuracy would improve.
Results: Logistic regression and linear discriminant analysis had the same predictive
accuracy with classification rates of 78.76% and 78.86% respectively. Both methods
identified two common factors, sex and birth weight, and the coefficients of the two factors
had the same negative sign but the magnitude differed significantly, both had low sensitivity
(13.19% and 8.68%) and high specificity (97.44% and 98.24%). Combining the two methods
did not improve predictive accuracy (71.5% before and 70.24% after).
Conclusion: The two multivariable techniques tend to converge in classification accuracy
mainly when the sample size is large (>50) but when faced with making a choice between the
two, it is recommended to use the method whose assumptions for application are fulfilled.
iv
ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to my supervisors, Professor S. Rusakaniko and
Mr V. Chikwasha for their valuable academic guidance throughout this project and also to Mr
W Tinago and Mr G Mandozana for their occasional contribution. My acknowledgement
would be incomplete without extending my heartfelt appreciation to Zvitambo Institute for
Maternal and Child Health Research for allowing me to use their data for this research, with
special gratitude to Mr R Ntozini, Mr B Chasekwa and Dr M Mbuya who took a special
interest and commitment to see me through this project.
Last but not least, a special thank you to my fellow classmates for helping me to remain
focussed on the reason we started this academic journey and my sincerest gratitude goes to
my boys for soldiering on for so long without me.
v
Table of Contents
DECLARATION FORM ..................................................................................................................... ii
ABSTRACT .......................................................................................................................................... iii
ACKNOWLEDGEMENTS ................................................................................................................ iv
CHAPTER ONE: INTRODUCTION ................................................................................................. 1
1.0 Stunting ......................................................................................................................................... 1
1.1Multivariable Statistical Techniques .............................................................................................. 3
1.2 Description of the Original Study ........................................................................................... 4
1.2.1 Research Primary Objectives ............................................................................................... 5
1.2.2 Specific Objectives of the Zvitambo Study ........................................................................... 6
1.2.3 Subjects, Materials and Methods of the Zvitambo Study ...................................................... 7
1.2.4 Data Management ................................................................................................................ 10
1.2.5 Data Analysis ....................................................................................................................... 10
1.3 Critical Appraisal of the Study.................................................................................................... 11
1.3.1 Research Primary Objectives ............................................................................................... 11
1.3.2 Specific Objectives of the Study .......................................................................................... 12
1.3.3 Study Design ........................................................................................................................ 13
1.3.4 Sample Size .......................................................................................................................... 13
1.3.5 Sampling Methods .............................................................................................................. 14
1.3.6 Data Collection Methods ..................................................................................................... 14
1.3.7 Data Analysis ....................................................................................................................... 15
1.4 Quality of Data ............................................................................................................................ 16
1.5 Problem Statement ..................................................................................................................... 17
CHAPTER TWO: LITERATURE REVIEW .................................................................................. 18
2.1 Stunting ....................................................................................................................................... 18
2.2 Logistic Regression and Discriminant Analysis ......................................................................... 19
2.3 Research Questions ..................................................................................................................... 21
2.4 Justification Of The Study .......................................................................................................... 22
2.5 Research Objectives .............................................................................................................. 22
CHAPTER THREE: METHODOLOGY ......................................................................................... 23
3.1 Description of Data ..................................................................................................................... 23
3.2 Sample Size ................................................................................................................................. 24
3.3 Secondary Data Analysis Variables ............................................................................................ 25
3.4 Data Management ...................................................................................................................... 25
vi
3.5 Statistical Analysis ...................................................................................................................... 26
3.6 Ethical Considerations ................................................................................................................ 29
CHAPTER FOUR: RESULTS .......................................................................................................... 30
4.1 Demographic Characteristics of the Participants .................................................................. 30
4.2 Univariate Analysis ............................................................................................................... 30
4.3 Logistic Regression Model ................................................................................................... 32
4.4 Linear Discriminant Analysis Model .................................................................................... 33
4.5 Comparison of Logistic Regression and Linear Discriminant Models ................................. 34
4.6 Linear Discriminant Analysis as an Exploratory Step for Logistic Regression .................... 37
CHAPTER FIVE: DISCUSSION ...................................................................................................... 40
5.1 Discussion of Results ............................................................................................................ 40
5.2 Limitations of the study ........................................................................................................ 43
CHAPTER SIX: CONCLUSION ...................................................................................................... 44
REFERENCES .................................................................................................................................... 45
vii
LIST OF TABLES
Table 1: New Data Dictionary of Variables used in Secondary Analysis ...................................... 26
Table 2 : Demographic Characteristics of the Participants ............................................................ 31
Table 3: Results of Univariate Analysis to establish association of individual variables with stunting. ............................................................................................................................................... 32
Table 4: Results of Bootstrapped Logistic Regression ..................................................................... 33
Table 5: Results of Bootstrapped Linear Discriminant Analysis ................................................... 34
Table 6: Comparison of Logistic Regression and Linear Discriminant Analysis in terms of Sensitivity, Specificity and Classification Accuracy ........................................................................ 35
Table 7: Comparison of the Two Logistic Regression Models ........................................................ 38
LIST OF FIGURES
Figure 1: Receiver Operating Characteristics (ROC) curve for Logistic Regression model. ...... 36
Figure 2: Receiver Operating Characteristics (ROC) curve for Linear Discriminant Analysis model. ................................................................................................................................................... 37
Figure 3: Receiver Operating Characteristics (ROC) curve for the second Logistic Regression model. ................................................................................................................................................... 38
LIST OF APPENDICES
Appendix A: Joint Research Ethics Committee Approval .............................................................. 48
Appendix B: Letter of Authorisation to use Data ............................................................................ 49
Appendix C: Logistic Regression and Linear Discriminant Analysis ............................................ 50
Appendix D: Zvitambo Questionnaire Used for Data Collection ................................................... 54
1
CHAPTER ONE: INTRODUCTION
1.0 Stunting
Stunting is a well established child health chronic malnutrition indicator associated
with biological, environmental and socio-economic factors. It is defined as having a
height/length-for-age which is more than 2 standard deviations below the median of the
National Committee on Health Statistics (NCHS)/World Health Organisation (WHO) growth
reference.22 Height-for-age indicates the linear growth that has occurred at the time of taking
the anthropometric measurement, taken in an upright or standing position whilst length is
measured in a recumbent position. Stunting is therefore determined by measuring the height
or length of an infant or child as well as gender and age. Data for this purpose is therefore
readily available as the measurements are non-invasive and also cheap. In 2000, it was
established that 33% (182 million) of the world’s children were stunted and almost all of the
cases were found in developing countries, with 70% occurring in Sub-Saharan Africa and
South Asia.24,25 Prevalence of stunting in Zimbabwe according to the Zimbabwe
Demographic and Health Survey of 2012 stood at 33%.31 This research project, among other
objectives estimated the prevalence of stunting in the Zvitambo cohort which gives a rough
estimate of the prevalence of stunting in Harare urban as at the time of the trial (1997-2000).
Impaired development in children and later in adults may largely be as a result of low
birth weight. Low birth weight may be caused by intrauterine growth retardation which in
turn may be attributed to factors such as maternal under-nutrition, maternal smoking,
infection during gestation among other causes. Seeing that low birth weight has a long term
impact on children’s (and adults’) health, there is need to address the problem through
targeted interventions. Stunting is believed to develop within the first two years of life and
in most resource-limited countries there will be little or no recovery thereafter.24 It has been
2
noted that surprisingly, for stunting, the length for age global mean at birth is very close to
the NCHS standard but growth starts to deteriorate soon after birth and persists well into the
third year of life.24 It is under the assumption that children do not grow well because of lack
of the proper foods in the right quantities23 that enormous research has focused on identifying
dietary solutions for stunting.26 Assistance in the form of nutritional support and education as
well as close monitoring of the high risk infants and management of those affected are
expected to go a long way in curbing the rate of stunting in children. The measurement of the
prevalence of stunting at a later stage in life reveals the success of such interventions
delivered to low birth weight infants.22
The studies carried out in various countries were aimed at understanding the risk
factors for stunting so as to inform public policy and develop interventions that would target
the identified factors. Stunting is of public health importance because it impacts on child
mortality, a child’s cognitive development and adult economic productivity.26 Thus stunted
children tend to be slow learners in school and generally develop into underperforming adults
in life because stunting causes physical and functional deficits. Due to the public health
impact of stunting, WHO collects prevalence data at national levels which is standardized in
a systematic way to allow for inter-national comparisons and identify countries with greater
need for interventions. WHO and the Centre for Disease Control (CDC) have availed free
software packages, ANTHRO and EPIINFO for the standardized computation of Z-scores
which measure various nutritional indicators such as height-for-age, weight-for-age,
nutritional status, etc.
The data collected by the Zimbabwe Vitamin A for Mothers and Babies
(ZVITAMBO) study is ideal for the evaluation of the risk factors for stunting in children as it
follows up babies from birth to 24 months recording a wide variety variables. Anthropometric
measurements such as length and weight were taken at all follow-up visits, gender and age
3
also recorded thus enabling the calculation of Z-scores for height-for-age which would
determine whether an infant is stunted or not. Other variables like the mother’s age,
education level, occupation, housing, number of live births, birth interval, household income,
baby’s gestational age and birth weight were also recorded and constitute the potential risk
factors for stunting.
1.1Multivariable Statistical Techniques
Research, mainly medical research usually focuses on the relationship of an outcome
with multiple covariates, normally possible risk factors. Multivariable statistical techniques
are usually used for analysis. Examples of such techniques are multiple linear regression,
logistic regression, poisson regression, discriminant analysis, etc. Each technique has its
assumptions and conditions best suited for its use. Multiple linear regression is known to be a
very flexible multivariable regression technique used to analyze relationships between
multiple independent variables and a single continuous dependent variable.9 Its popularity is
based on its ability to handle all types of independent variables, namely continuous and
categorical but it however falls short when it comes to categorical dependent variables.18
Discriminant analysis and logistic regression are two widely used multivariable analytical
regression techniques for analyzing categorical outcomes.11
Logistic regression is a type of regression which is used when the dependent or
outcome variable is binary, discrete or categorical and the predictor or independent variables
are of any kind. It is particularly useful in health sciences as the dichotomous outcome is
often the presence or absence of some health condition or disease. Unlike linear regression,
logistic regression uses the logit transformation to predict group membership based on
several covariates irrespective of their underlying distribution thus it avoids predicting
negative probabilities of group membership.28,33 It is especially important when the outcome
4
has a non-linear (sigmoidal) relationship with the independent variables.33 Logistic
regression analysis is based on the calculation of the odds of an outcome, which is the ratio of
the probability of having an outcome or belonging to one group divided by the probability of
not having the outcome or not belonging to that group. Discriminant analysis is a similar
classification technique which is used to determine which set of predictor variables strongly
discriminate between two or more naturally occurring, mutually exclusive groups. It
estimates orthogonal discriminant functions, which are linear combinations of the
standardised independent covariates which yield the largest mean differences between
groups.28.33
Thus the two multivariable regression techniques are similar in many aspects but also
have distinct differences which result in logistic regression being more popularly used than
linear discriminant analysis. Both methods are useful when the outcome variable is
categorical, each technique may be used to answer questions for which the other is designed,
though estimators are calculated using different methods.4,12 The major difference between
the two techniques is that, discriminant estimators are more powerful when the covariates are
normally distributed with equal covariance, an assumption not necessary with logistic
regression. Thus in circumstances where the normality assumption is not significantly
violated discriminant analysis and logistic regression may be used to solve the same problem,
rendering a comparison of the two techniques possible based on some measures of predictive
accuracy.11 This research, therefore, seeks to compare the two methods in predicting stunting
in children using a range of predictor variables.
1.2 Description of the Original Study
Maternal-to-child transmission accounts for 90% of all paediatric HIV infections
worldwide. Among breastfeeding populations, vertical transmission rates of 20-40% have
5
been reported, of which about a third of these occur during breastfeeding. This risk is a great
dilemma for most African countries where 30% of lactating women are HIV positive and
alternative feeding choices are out of reach and would risk the numerous infant lives saved by
breastfeeding annually. Thus, to balance the two risks, an intervention which reduces the
infectiousness of breast milk of HIV positive women and could be applied universally was
needed.
Emerging data has indicated that, vitamin A deficiency in HIV positive women is
associated with higher breast milk viral load and higher vertical transmission rates. This
therefore suggests that maternal vitamin A supplementation in the immediate postpartum
period may reduce the risk of transmission during lactation. Vitamin A supplementation in
HIV negative women will also improve the vitamin A status of the mother and her breast fed
infant. In addition, the supplementation will also reduce the risk of horizontal transmission in
HIV negative women during the postpartum year when they are particularly at high risk of
getting infected. Vitamin A supplementation to the neonate would also have an additional
benefit of substantially reducing early infant mortality.
1.2.1 Research Primary Objectives
To determine if oral administration of single doses of vitamin A to mothers and
neonates during the immediate postpartum period will reduce:
1. Vertical HIV transmission during lactation by at least 30%,
2. Horizontal HIV transmission among seronegative women by at least 25%, and
3. Infant mortality by at least 30%.
6
1.2.2 Specific Objectives of the Zvitambo Study
1. To determine if oral administration of a single 400 000IU dose of vitamin A given
during the immediate post-partum period to HIV seropositive lactating mothers
will reduce HIV transmission via breastfeeding by 30%.
2. To determine if oral administration of a single 400 000IU dose of vitamin A given
during the immediate post-partum period to HIV seronegative lactating mothers
will reduce their rate of seroconversion during the post-partum year by at least
25%.
3. To determine if oral administration of a single 50 000IU dose of vitamin A given
to neonates, a single 400 000IU dose of vitamin A given to lactating mothers or
supplementation of both mother and infant during the immediate post-partum
period will reduce infant mortality by at least 30%.
1.2.2.1 Secondary Objectives:
i. To examine the association between maternal vitamin A status and viral load in
breast milk and plasma in women, and determine if vitamin A supplementation
reduces plasma and breast milk viral load and increases CD4 lymphocytes;
ii. To examine the timing of post-partum vertical HIV transmission and determine
whether maternal-neonatal vitamin A supplementation affects this timing;
iii. To investigate the relationship of acute phase reactants with serum retinol in sick
and healthy children and describe how serum retinol measures might be adjusted
by concurrent acute phase reactant measures to control for the effects of infection.
7
1.2.3 Subjects, Materials and Methods of the Zvitambo Study
1.2.3.1 Study Design
The study was a 2x2 factorial randomized, double-masked, placebo controlled trial of
14 110 mother-infant pairs randomized to one of four treatment arms:
Vitamin A Dose
Treatment Infant Mother
I(aA) 50 000IU 400 000IU
II(aP) 50 000IU Placebo
III(pA) Placebo 400 000IU
IV(pP) Placebo Placebo
Randomization was stratified by infant birth weight, thus according to
estimated distribution of all births 90% of participants were normal birth weight
babies whilst 10% were low birth weight babies. This was done in order to control
for potential confounding since low birth weight is a risk factor for adverse health
outcomes in babies.
1.2.3.2 Study Area and Target Population
The study area was greater Harare urban, Chitungwiza and Epworth. The
study was health facility-based with the participants being enrolled at any one of the
following health centres: Harare Central Hospital, Chitungwiza Hospital and twelve
City of Harare clinics. The target population was all women of child-bearing age (15-
49 age group) in Harare and Chitungwiza.
8
1.2.3.3 Study Participants
Mothers who delivered their infants at any one of the fourteen research sites
were eligible to participate if both the mother and her infant did not suffer any life-
threatening complications during delivery, if the infant weighed at least 1 500 grams
at birth, and if the mother intended to stay in Harare for at least two years after
delivery.
1.2.3.4 Sample Size
Sample size calculations that yielded 14000 participants for the main trial and
primary objective of vertical HIV transmission were done based on the following
assumptions:
1. The prevalence of HIV infection among women enrolled in the study was 30% at
baseline.
2. The mother-to-child transmission rate among HIV+ mothers in the control group
was 30% by 24 months and 10% during breast feeding.
3. The seroconversion rate among the HIV- mothers in the control group during the
first year post partum would be 6%.
4. The infant mortality rate in the treatment arm in which both mother and infant
received placebo would be 60 per 1000.
5. The use of two-tailed tests with an overall type I error of 5% and a type II error of
20%.
6. A reduction in total post partum HIV vertical transmission of 20%.
7. A retention rate of 90% of all participants.
Sample sizes for the other two primary objectives and the sub-studies were calculated
based on the main trial sample size and the relevant prevalence and anticipated
reduction rates.
9
1.2.3.5 Sampling Methods
The 14 110 participants were recruited on Mondays to Fridays over a period of
18 months and were distributed proportionately between Harare Hospital(60%) and
the City Health Service Clinics(40%) according to the usual births distribution in the
city. At all the fourteen study sites convenience sampling was used. Thus all mothers
delivering at each of the participating health centres during the 18-month recruitment
period were considered for eligibility. If the woman did not meet the inclusion
criteria, or refused to participate then the next mother on the delivery register was
considered.
1.2.3.6 Data Collection Methods
A team of specially trained nurses at each participating site were responsible
for the recruitment process. Following a written consent from eligible women, a
standardized questionnaire and transcription of hospital records were used to obtain
baseline information on the mother and infant. Vitamin A supplementation was then
administered according to the randomization plan. A follow-up questionnaire was
used to record information such as history of maternal and infant illness, feeding
practices, anthropometric measurements, etc at each follow-up visit. Specialized
questionnaires were also used to gather information on adverse health events such as
hospitalization.
The consenting mothers and their infants would also undergo physical
examinations by trained nurses which included taking various measurements such as
infant weight and length. Blood samples were obtained by venipuncture from the
mothers and by heel prick from the infant. Colostrum at baseline and breast milk on
follow-up visits were collected by manual expression. The blood and milk specimens
10
were subjected to an array of tests including HIV and milk retinol tests in specialized
laboratories by trained personnel.
1.2.3.7 Outcome Measures
The measurable outcomes for the study included HIV status for both mother
and infant taken at all visits to establish interval of infection. Breast milk specimens
were taken to establish vitamin A status and HIV viral load. Vital status of the
mother and the infant was established in order to determine mortality rates in the
cohort. Morbidity history of the pair was also recorded.
1.2.4 Data Management
At enrolment the participating mother and infant pair were allocated an identity code
linked to the capsule packet given to them and at each follow-up visit all questionnaires and
specimens were further coded by a two-digit visit number and a one letter code for specimen
type. Data collected at all research sites was checked for legibility, completeness and
accuracy by an appointed nurse supervisor before being taken to the Data Entry Shop at
Harare Hospital. All study instruments were double-entered by two data entry clerks using
SPSS-DE for Windows and any discrepancies were resolved by referring to the original hard
copies.
1.2.5 Data Analysis
Descriptive analyses of demographic and clinical characteristics of the mothers and
infants were done for the entire study population and by treatment arm to check success of
randomization indicated by random distribution of these characteristics across all arms.
11
Continuous and categorical baseline characteristics were compared across the 4 treatment
groups using Kruskal-Wallis and χ2 tests, respectively.
The percentage efficacies of maternal and infant vitamin A supplementation in
reducing postpartum vertical transmission, horizontal transmission and infant mortality were
calculated using the Turnbull method. Confidence intervals (95%) were computed for the
efficacy using 2000 bootstrap samples. Exclusive breastfeeding rates were estimated using
Kaplain-Meier methods and compared across the treatment arms by pairwise log-rank tests.
T-test and linear regression models were also used to investigate the effect of vitamin A
supplementation on the quantity of HIV in breast milk and plasma, on CD4 counts, on the
infant vitamin A status during the first year of life and on the association between acute phase
reactants and serum retinol concentrations.
Logistic regression was used to examine the effect of neonatal and/or maternal
vitamin A supplementation on vertical transmission, horizontal transmission and infant
mortality controlling for selected variables such as maternal age, socioeconomic status, serum
retinol concentration at birth, infant birth weight, nutritional status, among others. Kaplain-
Meirer survival curves and Cox proportional hazards regression models were used to
compare the timing of infection of the infant and the timing of death in the mother-infant
pairs who received vitamin A supplementation and those who did not.
1.3 Critical Appraisal of the Study
1.3.1 Research Primary Objectives
According to the WHO peri-natal transmission study carried out in Harare, mother-to-
child vertical transmission of HIV account for over 90% of all pediatric HIV infections
nationwide and between a quarter to a third of these occur during breastfeeding. Vitamin A
deficiency among HIV seropositive women has been shown to be associated with higher
12
breast milk viral load and hence higher vertical transmission rates,6,30 suggesting that
maternal Vitamin A supplementation during the immediate postpartum period may reduce the
risk of transmission during breast feeding. This indicates that the primary objectives of the
study were targeting a real problem in the prevention of mother-to-child HIV transmission.
1.3.2 Specific Objectives of the Study
1. The objectives of the study were specific as they clearly stated what was to be done
and for whom the intervention was intended. Mothers and their new born babies
received a specific dose of Vitamin A during the immediate post partum period and
the expected effect was specified.
2. The measurability of the objectives were illustrated in the fact that the anticipated
results were clearly quantified in comparison with the known baseline status. Vertical
HIV transmission and infant mortality reduction of 30% and a 25% reduction in
horizontal transmission in the mothers was expected.
3. The planning stages of the research involved wide reading in the area of interest so as
to learn from the experiences of others. The budget and expert human resources
required for the study were sourced. Thus achievable objectives were set in line with
the secured financial and human resources as well as the predetermined time frame.
4. The objectives of the study were quite relevant considering that at the time the
research was done (1997-2000), strategies to curb the HIV/AIDS pandemic remains
of paramount importance to the health of the nation. Thus the objectives were
relevant to the main goal of HIV transmission.
5. The objectives were referring to the postpartum period, with the outcomes of
horizontal transmission, infant mortality and vertical transmission being measured at
one year and two years respectively, thus making the objectives time-bound. The
13
study participants were followed up for a specific time period, after which the
outcome of interest was measured.
1.3.3 Study Design
The study design used was a 2x2 factorial randomized, double-masked, placebo
controlled trial. Randomized controlled trials are best suited for investigating the effect of
intervention procedures on outcomes such as death or occurrence of disease.17 That makes it
the most appropriate design for studying the effect of a single dose of Vitamin A
supplementation to mothers and their neonates on HIV vertical and horizontal transmission as
well as infant mortality.
Randomized controlled trials are considered to be the ‘gold standard’ of evidence-
based medicine because they are the only known method that significantly minimizes
selection and confounding biases2 and help infer causality by establishing temporal sequence
between exposure and outcome. In this case, bias was further reduced by double-masking,
placebo control and stratifying by birth weight which is a known confounder, thus further
increasing confidence in the study findings. Since the study was investigating two treatments,
that is maternal and neonatal supplementation, the 2x2 factorial design was ideal as it further
allowed for the evaluation of the interaction that may exist between the two treatments17.
1.3.4 Sample Size
The sample size was sufficient for the anticipated effects especially considering the
calculations were done taking into account all the necessary assumptions and the expected
reduction rates. Longitudinal studies have the risk of loss-to-follow up and the sample size
calculations took into account a retention rate of 90% such that the study would retain its
power.
14
1.3.5 Sampling Methods
Participants for the study were enrolled at Harare and Chitungwiza Hospitals and all
city health service clinics. For the results of the study to be generalizable to the study
population, the participants had to be truly representative of the target population, thus
sampling methods had to be free of selection bias. Convenience sampling was used at all the
recruitment health facilities mainly due to the large numbers that were required for the trial so
every woman delivering at any of the study sites was considered for participation if they met
the inclusion criteria. This type of sampling is inexpensive, easy, fast and subjects are readily
available. However, convenience sampling introduces sampling bias into the study since the
resulting sample is not truly representative of the target population.
The sample for the Zvitambo study was selected only in Harare urban thus excluding
the rural population which has some inherent characteristics that are different from the urban
population and would most likely vary the findings. In addition, the study was health centre
based thus biasing the results to only those mothers with a high health-seeking behavior.
These two factors and the sampling technique used therefore limit the external validity of the
study findings.
1.3.6 Data Collection Methods
Interviewer-administered standardized questionnaires and chart reviews were
appropriately used to collect maternal and infant characteristics. These standardized tools
were appropriate as they managed to collect the same information from all participants in an
almost uniform manner minimizing interviewer bias. The questionnaires also collected
participant locater details such that in case of missed clinic appointments home visits could
be made.
15
The research objectives involved determining HIV status of both mother and infant,
their vitamin A status and breast milk viral load at regular time points hence the collection of
blood and milk specimens was very appropriate. The use of specially trained nurses and
laboratory personnel in data collection ensured a high quality of specimen collection and
diagnostic testing on which the credibility of the study findings were hinged.
1.3.7 Data Analysis
An intent-to-treat analysis was carried out as all (14110)mother-infant pairs who were
randomized were included in the analysis. This is the classic analytic approach for any
experimental study as it measures the effectiveness of the intervention under everyday
practice conditions, hence was the most ideal for this study. Just as required for any
randomized control trial, the first step was to carry out descriptive analyses of demographic
and clinical characteristics of the entire population and by treatment arm to examine success
of the randomization process and this was appropriately done. Comparison of these
characteristics across treatment arms was done using Kruskal-Wallis and χ2 tests for
continuous and categorical variables, respectively and these were relevant as it was necessary
to compare the distribution of all other characteristics across treatment arms.
The percentage efficacies of maternal and infant Vitamin A supplementation in HIV
transmission and mortality was calculated using the Turnbull method which was appropriate
for the type of data whereby the exact time of censoring was not known but instead only the
interval within which the event occurred was known. To establish the effect of Vitamin A
supplementation on HIV transmission and infant mortality, logistic regression techniques
were used to control for other potential predictors. This technique was used appropriately
considering that the predictor variables were a combination of categorical and continuous
variables, some of which were not necessarily normally distributed.
16
Survival analysis using the Kaplain-Meier methods and Cox proportional hazards
regression was done. This type of analysis was appropriate since it was necessary to compare
survival in the different treatment arms in terms of HIV infection as well as infant death. It
was of interest to determine the probability of remaining infection-free and the probability of
the infant surviving beyond a specific time period with respect to the treatment arm.
Comparisons of continuous variables such as milk, serum viral load and vitamin A status
across study arms were done using t-test, ANOVA and linear regression. These statistics
were relevant considering that the variables under consideration were continuous and largely
normally distributed with possible calculation of means across the study arms.
1.4 Quality of Data
The integrity of any study is hinged on the quality of the data collected as this
represents the actual study findings. The quality of research data refers to its state of
completeness, relevancy, internal and external validity, consistency, timeliness and accuracy
which makes the data appropriate for a specific use. The critical appraisal of the Zvitambo
study highlighted the measures that were taken during data collection and handling which
ensured that the resulting data was of very high quality. The data was collected using
standardised questionnaires which were administered by trained personnel, double data entry
was implemented with physical verification used in cases of any discrepancies. The data was
edited by way of manipulating some variables, cleaned, verified and validated thus enhancing
its quality. The researcher hence has a very high level of confidence in the quality of the data
used in this research and this being enhanced by the fact that even more than a decade after
the study was completed the data was being used to answer different research questions
including its use in international meta-analysis.
17
1.5 Problem Statement
Discriminant analysis answers the question “What is the probability of correctly
classifying an observation”, whilst logistic regression answers “What is the probability of
success given a set of covariates”4. The two questions are so similar such that either
technique can be used effectively to answer questions for which the other is designed. That
being the case, it may be worthwhile to study how the predictive accuracy of the two
techniques compare. On the other hand, discriminant function estimators may be used as an
exploratory stage in the process of fitting a logistic regression model.14 In this case it is of
interest to ascertain whether combining the two techniques improves the predictive power of
the resulting logistic regression model.
18
CHAPTER TWO: LITERATURE REVIEW
2.1 Stunting
Wamani et al(2007) carried out a meta-analysis of 16 demographic and health surveys
of 10 sub-Saharan countries and noted that generally boys tend to be more stunted than girls
in the same age groups of under five years.20 They calculated a pooled estimate of the mean
z-scores which was statistically different for boys and girls and the prevalence of stunting
was also higher among boys than among girls. One of the studies by the same author
revealed that the differential in stunting rates between boys and girls was more pronounced in
the lower socio-economic groups than in the well- to-do groups.27
Studies carried out in several countries to investigate the risk factors for stunting in
children identified a variety of biological, socioeconomic, behavioural and environmental risk
factors. Some of the studies established that different risk factors are at play at different
stages of a child’s development. Studies carried out in the Phillipines and Indonesia indicate
that the principal risk factors for stunting below six months of age are maternal behaviours
and child biological characteristics e.g breast feeding status, sex and birth weight whilst after
six months, household socioeconomic status, behavioural and biological characteristics
become important e.g father’s education or occupation, age and sex.15,16 Studies done in
Brazil, China, India and Nigeria also noted that sanitation in the area, mother’s age, birth
interval, family size and attendance of public schools were among the risk factors for
stunting.11,13,17,19 Interventions targeted at women and children with special needs were
proposed such as increasing women’s access to education and prenatal care, encouraging
exclusive breastfeeding and family planning and interventions targeted at low birth weight
babies.16
19
All the studies reviewed used a cross-sectional study design whereby the various
anthropometric measurements were taken and questionnaires were used to collect other
historical information, behaviours and socioeconomic details. Multi-stage stratified sampling
was employed in the selection of participants15,16 mainly to ensure that risk factors were
examined within the same age groups and environmental setting (rural/urban). In the data
analysis, univariate analysis was first done to establish associations between stunting and
each of the risk factors. Logistic regression in all cases was the multivariable technique used
to check which predictor variables were independently associated with the outcome adjusted
for the effect of the other risk factors.
2.2 Logistic Regression and Discriminant Analysis
All the studies reviewed on stunting, used logistic regression as the multivariate
technique to determine the major risk factors for stunted growth in children. This is possibly
due to the robustness and the fact that the risk factors being investigated are a mixture of
categorical and continuous variables. An additional factor that enhances the utility of logistic
regression when compared with discriminant analysis is the assumption of normality and
equal variances of the predictor variables which is hardly satisfied in practice.14 However,
when the normality assumption is valid the discriminant analysis estimator is more efficient
than the logistic regression estimator5,14. Contrary to popular belief that logistic regression is
limited to two categories for the dependent variable Hosmer and Lemeshow (1989) have
proven that it can be applied to situations with more than two categories. On the other hand
discriminant analysis is believed to be applicable only when independent variables are
interval scaled but research has also revealed that it can be used with both continuous and
categorical data.10,18These developments set a basis upon which the two multivariable
techniques can be compared in answering the same question using the same data set and
establish if their predictive accuracy is comparable.
20
Press and Wilson (1978) investigated what has to be considered when making a
choice between using logistic regression and discriminant analysis after noting that the two
methods may be used to answer the same research question. They first outlined the
theoretical differences between the two methods which include the underlying assumptions
and the distinct methods of estimating coefficients and then apply the two methods to some
empirical data. The resulting models were then compared and performance of each technique
was determined by a classification rate, in which case logistic regression outperformed
discriminant analysis. This was alluded to the fact that whenever the assumption of normality
of covariates is violated, discriminant analysis performs poorly.14
Pohar et al (2004) compared logistic regression and discriminant analysis using
simulated data and in addition to the classification error rate criteria they also used some
indexes adopted from Harrell and Lee (1985).12 The indexes simply called A, B, C and Q
were noted to be statistically more efficient than the classification error rate as they reveal
how well each model discriminates between the groups and how good the prediction is.29
Antonogeorgos G, et al (2009) carried out logistic regression and discriminant analyses to
evaluate factors associated with prevalence of asthma among children whereby they intended
to evaluate the divergence and similarity of the two statistical techniques. The study used
cross-sectional anthropometric and lifestyle data from 10-12 year old children in Greece and
related them to the presence of asthmatic symptoms. Logistic regression and discriminant
analyses produced similar models upon comparison of sign and magnitude of coefficients, the
area under the Response Operating Characteristic (ROC) curves which indicates
classification accuracy by plotting sensitivity against specificity of the model .28
Montgomery et al (1987) compared the two techniques using two data sets on the
predictor variables of coliform mastitis on dairy cows. Their study revealed that the methods
identified the same variables as important predictors and were equally useful in classifying
21
cows as diseased or not diseased though logistic regression had fewer classification errors.
Comparisons were done with respect to the variables selected, order of selection, sign and
magnitude of the variable coefficients, specificity, sensitivity and overall correct
classification rate investigated at varying probability cutoff points. Response Operating
Characteristic (ROC) curves for the two models were compared on the same axes and
revealed that logistic regression had a better classification ability than discriminant analysis ,
that is, it correctly classifies more cases than does discriminant analysis.32 Panagiotakos
(2006) set out to compare logistic regression and discriminant analysis in the prediction of in-
hospital mortality of patients admitted with Acute Coronary Syndrome. Like Montgomery et
al (1987), he compared the selected variables, magnitude and sign of the coefficients,
specificity, sensitivity, classification rate and ROC curves. For statistical generalization, a
non-parametric bootstrap technique was used to estimate both logistic regression and
discriminant analysis estimates. He concluded that the two methods resulted in the same
model but logistic regression gave a better classification rate.33
The studies reviewed indicate that in the majority of situations logistic regression and
discriminant analysis produce models that converge, provided all statistical assumptions for
the two techniques are satisfied. Logistic regression however, becomes preferable given the
usual failure to meet the assumptions of equal covariance and multivariate normality as it
tends to produce more stable estimates and can handle data on any measurement scale.
2.3 Research Questions
1. What are the prevalence and risk factors for stunting in the infants enrolled for the
Zvitambo study?
2. What is the relative predictive accuracy of discriminant analysis and logistic regression
(which analytical method is more reliable in classification of subjects into categories)?
22
3. Can discriminant analysis be used as an exploratory stage in the fitting of a logistic
regression model in order to enhance predictive power?
2.4 Justification Of The Study
From a review of the literature, discriminant analysis and logistic regression can
apparently be used to answer the same research question although their solutions may be
fundamentally different14. Against this backdrop, it is cause for concern that research hardly
employs discriminant analysis as an analytic technique, and yet it may be a reliable technique
with low classification error when required to predict membership to a given category using a
set of explanatory variables. The proposed analysis aims to establish whether the two
techniques have the same predictive accuracy and therefore utility or whether they are
complementary to the effect that their use in combination rather than individually improves
predictive power. Some studies that compared the two statistical techniques used simulated
data sets, so it would be worthwhile to compare the two using real research data.
2.5 Research Objectives
1. To determine the prevalence and risk factors for stunting in the Zvitambo cohort.
2. To compare the predictive accuracy of logistic regression and discriminant analysis in
the prediction of stunting in children.
3. To determine whether discriminant analysis can be used as a preliminary exploratory
step to logistic regression to improve prediction power.
23
CHAPTER THREE: METHODOLOGY
3.1 Description of Data
The data that was used in this research was collected by the ZVITAMBO study –
“Vitamin A supplementation of breast feeding mothers and their neonates at delivery: Impact
on mother-to-child HIV transmission during lactation, HIV infection among women during
the post-partum year, and infant mortality.” This was a two year follow-up study carried out
in Harare with enrolment of mothers and their infants taking place within 96 hours of
delivery and follow-up visits at 6 weeks and at 3 monthly intervals thereafter, with the first
recruitment done in 1997 and the last follow-up in 2000. Maternal and infant characteristics,
blood and breast milk specimens were collected at baseline and follow-up visits at Harare and
Chitungwiza hospitals and at twelve city health service clinics. Maternal variables measured
included among others, age, level of education, nutritional status, employment status,
monthly income, birth interval, knowledge of feeding practices, morbidity history and their
partner’s level of education and employment status. Infant characteristics that were collected
included gestational age, delivery method, sex, length, weight, feeding practices and
morbidity history. These variables constitute the covariates that were used in modeling
stunting in the infants using logistic regression and linear discriminant analysis.
The data that was used is longitudinal in nature with observations being recorded at 3-
monthly regular intervals, thus the outcome of interest, stunting may be measured at different
stages of the infant’s development. For the purposes of this secondary data analysis, stunting
was considered at 12 months. Thus a cross sectional analysis at that particular time interval
was done as literature has revealed that different factors may be associated with stunting at
the different stages of child development.15
24
3.2 Sample Size
The original study required a minimum sample size of 14000 but enrolled a total of
14110 participants thus ensuring that the study maintained its anticipated power regardless of
possible missingness. The sample size used for this analysis was 9555, which was the total
number of observations in the Zvitambo data set less those participants who had no
anthropometric measurements at 12 months and any other missing variables at baseline. This
sample size was deemed adequate for the purposes of this research based on three
components, namely required sample size for applying the two analytical techniques, logistic
regression and discriminant analysis as well as the cross-sectional design adopted in handling
the application data.
Using the Dobson’s formula for the calculation of sample size for a cross sectional
study design, a minimum sample size of 340 was calculated for a precision of 0.05, 95%
confidence level(z = 1.96) and stunting prevalence of 33%(from the 2010-11 ZDHS Report),
that is: n =
=
= 339.7511
Discriminant analysis requires that the minimum sample size be at least five times the
number of categories with at least 20 cases per category28, thus giving a minimum sample
size of 40 since there were two categories. Logistic regression is not very sensitive to sample
size such that a sample sufficient for the study design employed would be adequate for the
use of the technique. Therefore, a sample of 9555 was adequate for the purposes of this
project, as it was well above 340, the highest of the minimum sample sizes required by any of
the three criteria. One of the objectives of the secondary data analysis was to estimate the
25
prevalence of stunting in the Zvitambo cohort, hence the use of 9555 observations which
constituted all the infants in the study who had anthropometric measurements at 12 months.
3.3 Secondary Data Analysis Variables
The Zvitambo study collected a wide range of variables, up to two hundred in total
(recruitment and follow up variables), some of which were not relevant to this research
analysis hence the relevant variables had to be identified. The identification process was
aided by a review of related literature on stunting which gave an indication of potential risk
factors. The outcome measure for this study is stunting but this was only determined after
some calculations using infant length, age and sex. The risk factors extracted from the
original data were mode of delivery, sex of infant, birth weight, gestational age, maternal age,
maternal and partner’s occupation, maternal housing, maternal and partner’s years of formal
schooling, family income, birth interval, number of live births, infant morbidity history and
breast feeding status. A new variable dictionary was produced for this research as illustrated
in Table 1.
3.4 Data Management
The variables identified were used to identify the appropriate tables in the database.
Dbase Plus 8 was used in the Zvitambo study but this research used Stata 12, hence the data
used in the analysis was imported from Dbase to Stata12. Some of the variables were not in
the format suitable for analysis so the data had to be cleaned. This involved dropping some
participants who had some missing variables in their data, recoding some variables,
generating new variables by manipulating existing ones, merging several data sets. The main
outcome variable, stunting was not available explicitly in the data so it had to be created by
using the WHO Anthropometric Calculator which requires the length/height, age and gender
of the infant to calculate the height-for-age Z-scores. Classification as either stunted or not
26
stunted was done by considering whether the Z-score was below or above 2 standard
deviations respectively. The original data sets were maintained so as to provide back-up
copies.
3.5 Statistical Analysis
Descriptive analysis of the data was done to give a general description of the
demographic characteristics of the study participants. Univariate analysis was carried out in
order to establish which independent variables were individually associated with stunting.
Chi square test for independence was used for categorical variables whilst the Kruskal Wallis
test was used for the continuous ones. The two statistical methods, logistic regression and
discriminant analysis were then applied to compare their classification and predictive
abilities.
In order to validate the variables that are really important in the model and to allow
for statistical generalization a non-parametric bootstrap estimation procedure32 with 20
samples from the main data set was used for both logistic regression and discriminant
analysis. Stepwise logistic regression and canonical discriminant function analysis were
performed on the randomly selected part (50%) of the data set using STATA 12. The
continuous sampling and estimation allows for validation of the models. For logistic
regression the significance levels for entry and removal were set at 0.20 and 0.30 respectively
and the variables retained in more than 60% of the models were noted as important risk
factors. For discriminant analysis, variables with a correlation greater or equal to ±0.3 in the
canonical structure were retained as important covariates in predicting stunting. The means
of the coefficients obtained from the bootstrap samples were taken as the best estimates for
the two models. For the logistic regression parameter estimates, the corresponding bootstrap
confidence intervals for the means were also calculated.
27
Table 1: New Data Dictionary of Variables used in Secondary Analysis Variable Name
Variable Definition
Variable Type
Coding
PTID Participant ID
Nominal
a05 Mode of delivery Nominal 1=Normal vaginal 2=Breech vaginal 3=Emergency C-section 4=Elective C-section 5=Forceps 6=Vacuum
a14 Gender Nominal 0=Male 1=Female
a16 Birthweight(grams) Continuous
ga_clc Gestational age(days)
Continuous
mom_age Mother’s age(years)
Continuous
edu_mom Mother’s years of education
Continuous
a20 Maternal occupation
Nominal 1= Domestic/Unskilled worker
2=Skilled Manual
3=Clerical
4=Professional
5=Vendor
6=Unemployed(housewife)
7=Other
edu_partner Partner’s years of education
Continuous
a23 Partner’s occupation
Nominal 1= Domestic/Unskilled worker 2=Skilled Manual 3=Clerical 4=Professional 5=Vendor 6=Unemployed(housewife) 7=Other
a25 Maternal housing Nominal 1=Own 2=Rented 3=Lodge 4=Extended family 5=Employer provided 6=Other
a26_std Household income(Z$)
Continuous
a44 Number of live births
Continuous
birth_int Birth interval Continuous
28
Several diagnostic tests were performed on the logistic regression model and the
discriminant analysis model to check for validity of assumptions. Stepwise logistic
regression has an in-built mechanism to test for multicollinearity and drops any variables that
would distort the validity of the model. Overall fit of the logistic regression model was tested
using the likelihood ratio test, p-value < 0.05 representing statistical significance. Overall
significance of the discriminant function was tested using the Wilk’s Lambda test,
multivariate normality test was used to check for normality of the covariates and the
multivariate covariance test was used to test for equal covariance matrices.
The resulting models were used for classification, varying the cut-off points or prior
probabilities and noting how the sensitivity, specificity and overall classification rate(total
correct classification percentage) varies at each cut-off point. After investigating the models’
predictive abilities at different cut-off points, the summary statistics were compiled and
compared. The comparisons were based on the following aspects; the variables selected in
the models, the sign and magnitude of the coefficients, sensitivity and specificity of the
classifications, the overall classification rate at varying cut-off probabilities.32 The Response
Operating Characteristics(ROC) curves were also compared to determine which of the two
models enclosed a larger area indicating a better classification ability. The ROC curve plots
sensitivity(rate of true positives) and 100 minus specificity(rate of true negatives) at several
cut-off points and so provides a quick graphical assessment of the effect of varying the cut-
off point in any classification model.
To check whether the two methods can be used in combination to improve
classification accuracy, discriminant analysis was applied as an exploratory step to identify
those variables that strongly discriminate stunted children from their non-stunted
counterparts. These covariates were then subjected to logistic regression and the resulting
29
model was used in the prediction of stunting in the validation set and predictive accuracy was
compared to the results of the first logistic regression model.
3.6 Ethical Considerations
Approval to carry out the research was sought and granted by the Joint Research
Ethics Committee, approval letter in Appendix A. Permission to use the data was granted by
Zvitambo Institute of Maternal and Child Health, letter of authorization in Appendix B. The
Zvitambo trial was approved by the Medical Research Council of Zimbabwe (MRCZ).
30
CHAPTER FOUR: RESULTS
4.1 Demographic Characteristics of the Participants
Of the 9555 participants included in this study, 51% were males and the overall
prevalence of stunting in this Zvitambo cohort at 12 months of age was 22%, thus lower than
the national prevalence of 33%. Of the stunted infants, 62% were males indicating males tend
to be more stunted than females. Most of the babies were full term with a mean gestational
age of 275 days and a standard deviation of 10 days, had normal birth weights 2992 grammes
and a standard deviation of 449 grammes and most had a normal vaginal delivery (89%).
Almost all the babies (98%) were breast fed for some time within the two years follow-up. It
is also evident that the majority of the infants were born into low socioeconomic households
with unemployed mothers (81%), partners who did skilled or unskilled manual jobs (66%),
had an average monthly income of US$84.23 and many stayed in lodged accommodation
(59%). The majority of mothers and their partners were literate and had completed an
average of seven years of formal education.
4.2 Univariate Analysis
Table 3 shows results of the univariate analysis performed to investigate whether the
predictor variables have any individual association with the outcome, stunting. The following
factors were found to be significantly related to stunting (p-value < 0.05): sex, birth weight,
gestational age, mother and partner’s education, partner’s occupation and having fever.
31
Table 2 : Demographic Characteristics of the Participants
Characteristic Summary Statistic Categorical Variables Frequency N (%) Infant Gender Male
Female 4909 (51.38) 4646 (48.62)
Delivery mode Normal vaginal Breech vaginal Emergency C-section Elective C-section Forceps Vacuum
8427 (88.93) 122 (1.29) 597 (6.30) 210 (2.22) 2 (0.02) 118 (1.25)
Breastfeeding Ever breast-fed Never breast-fed
9365 (98.05) 186 (1.95)
Maternal Housing Own Rented Lodge Extended family Employer- provided Other
1017 (10.65 ) 141 (1.48) 5632 (59.00 ) 2156 (22.59) 541 (5.67) 58 (0.61)
Maternal Occupation Domestic/Unskilled Skilled manual Clerical Professional Vendor Unemployed Other
502 (5.26) 427 (4.48) 148 (1.55) 109 (1.14) 273 (2.86) 7864 (82.48) 212 (2.22)
Partner Occupation Domestic/Unskilled Skilled manual Clerical Professional Vendor Unemployed Other Don’t know
2526 (26.50) 3806 (39.93) 624(6.55) 875(9.18) 708 (7.43) 285 (2.99) 577 (6.05) 130 (1.36)
*Stunting-12 months Stunted Not stunted
1970 (21.58) 7160 (78.42)
Continous normal Mean ±SD Birth weight (grammes) 2992.386 ± 449.3237
Gestational Age (days) 275.0207 ± 9.8941
Mom’s education(yrs) (years) 9.9130 ± 2.1823
Partner’ education(yrs) (years) 11.1276 ± 2.0930
Continous non-normal Median (Q1; Q3) Birth interval (years) 3.9055 (2.7652 ; 5.5305) Mom’s age (years) 23.6468 (20.5969 ; 27.6222) Number of Live births 2 (1 ; 3) Family Income (USD) 84.23 (54.62 ; 141.54)
32
Table 3: Results of Univariate Analysis to establish association of individual variables with stunting. Variable Stunted
(stats) Non-stunted (stats)
p-value
Sex+ Males Females
1225 745
3476 3684
<0.001**
Delivery Mode+ Vaginal Non-vaginal
1762 208
6407 753
0.957
Mom’s Occupation+ Skilled Unskilled Unemployed
122 168 1633
534 570 5879
0.124
Partner’s Occupation+
Skilled Unskilled Unemployed
1006 744 65
4037 2361 212
<0.001**
Maternal Housing+ Own/Rented Other
324 1646
1281 5879
0.136
Breastfeeding+ Ever breastfed Never breastfed
1925 44
7026 132
0.264
Birth Weight* Grammes(mean) 2775.23 3055.94 <0.001**
Birth Interval*
Years(median) 3.72 3.95 0.2117
Gestational Age* Days(mean) 272.52 275.72 <0.001**
Mom’s Age* Years(median) 23.54 23.69 0.1624
Mom’s Education* Years(mean) 9.59 9.99 <0.001**
Partner’s Education* Years(mean) 10.88 11.19 <0.001**
Family Income* USD(median) 74.00 85.96 0.0579
Number of Live Births*
(median) 2 2 0.0905
Note: +χ2,*Kruskal-Wallis, **significant at p<0.05
4.3 Logistic Regression Model
Table 4 shows the variables that were identified by logistic regression analysis. After
carrying out the bootstrap estimation, those variables that were retained in more than 60% of
the models were considered to be important in the prediction of stunting in children. These
factors were gender, birth weight, household income, birth interval and the mother’s level of
33
education. From the signs of the coefficients of the variables, they all indicate a protective
effect against stunting. The 95% confidence intervals do not cross zero, thus indicating that
all these retained factors are significant, and they are relatively narrow implying very precise
estimates.
Table 4: Results of Bootstrapped Logistic Regression Variable Coefficient 95% CI
Gender -0.83439
-0.8733821; -0.7953979
Birth Weight -0.001585 -0.0016403; -0.0015297
Household Income -0.0001021 -0.0001153; -0.0000889
Birth Interval -0.0493 -0.0622183; -0.0363817
Mother’s Education -0.0537154 -0.065569; -0.0418618
Constant 5.7025 5.0328; 6.3721
The likelihood ratio test (p-value < 0.001) indicates overall statistical significance of the
model but however, the pseudo R2 value of 0.1 shows that the model only accounts for 10%
of the variability in stunting.
4.4 Linear Discriminant Analysis Model
Table 5 shows the variables that linear discriminant analysis identified as important
risk factors for stunting. A variable is considered as an important variable in the
discriminatory model if its correlation with the linear discriminant function (canonical
structure) is greater or equal to 0.3 in either direction. The three factors that were identified
as having significant discriminatory power were gender, birth weight and gestation age. All
three factors were protective against stunting as they all have negative coefficients. The
linear discriminant function was statistically significant (p-value<0.001 for the F statistic
34
derived from the Mahalanobis distance) thus the two categories of stunting were statistically
different. However, both the multivariable test for normality and for equal covariances were
significant (p-value < 0.001) implying that the two assumptions were violated. Normality
transformations were applied to the covariates but yielded no significant change.
Table 5: Results of Bootstrapped Linear Discriminant Analysis Variable Standardised Coefficient Canonical Structure
Gender -0.50229
-0.33464
Birth Weight -0.86139
-0.80483
Gestation Age -0.08688
-0.38873
4.5 Comparison of Logistic Regression and Linear Discriminant Models
After fitting the two models, they were compared on the basis of the variables
selected, the sign and magnitude of the coefficients, sensitivity, specificity, overall
classification accuracy and the areas enclosed under their respective ROC curves. Thus the
cut-off or probability points were varied from 0.1 to 0.9 and the resulting attributes at each
point were recorded and illustrated in Table 6. The cut-off points are such that if the
probability of being stunted is less than or equal to that value the individual is classified as
not stunted, otherwise they would fall in the stunted category.
With reference to Tables 4 and 5, the variables selected by the two models have two
variables only in common, namely gender and birth weight. Logistic regression in addition to
the two, also identified household income, birth interval and mother’s education as important
predictors of stunting whilst linear discriminant analysis picked gestational age as the other
important variable. Both methods yielded negative coefficients for all the factors identified as
important thus implying protective effects. The magnitude of the coefficients of the
35
common variables, gender and birth weight were different, thus indicating different degrees
of discriminatory or predictive power in each of the models.
Table 6: Comparison of Logistic Regression and Linear Discriminant Analysis in terms of Sensitivity, Specificity and Classification Accuracy Cut-off Point*
Logistic Regression Linear Discriminant Analysis
Sensitivity Specificity Accuracy Sensitivity Specificity Accuracy
0.1 94.7 21.2 37.5 100 0.52 22.05
0.2 70.59 59.37 61.86 98.72 6.99 26.84
0.3 44.86 82.16 73.89 93.69 22.41 37.83
0.4 23.68 92.98 77.61 81.83 44.12 52.28
0.5 13.19 97.44 78.76 62.83 65.33 64.79
0.6 5.51 99.14 78.37 41.58 83.37 74.33
0.7 1.95 99.88 78.16 22.38 93.37 78.01
0.8 0.11 100 77.85 8.68 98.24 78.86
0.9 0.00 100 77.82 1.44 99.84 78.55
*P(stunting): scores greater than the cut-off point are classified as stunted, whilst those less than or equal to the cut-off point are classified as not stunted.
Comparing the two models, it was noted that at all cut-off points the linear
discriminant model has a higher sensitivity (proportion of true positives among all the
positive results) than the logistic regression model whilst the logistic regression model has
higher specificity (proportion of true negatives among all the negative results) at all levels. In
terms of the overall classification accuracy rate, the logistic regression model performed
better than the linear discriminant model at most of the cut-off points except at the 0.8 and
0.9 levels where the linear discriminant model performed slightly better than its counterpart.
When both methods perform at their maximum classification accuracy (logistic-78.76%,
36
discriminant-78.86%), they have an extremely high specificity (logistic-97.44%,
discriminant-98.24%) and low sensitivity (logistic-13.19%, discriminant-8.68%), implying
that both models would be very good at identifying subjects without the condition. Though
attained at different cut-off points, the two methods have almost the same maximum
classification accuracy rate (78%) which is classified as acceptable discrimination according
to Hosmer and Lemeshow (2000).
The area enclosed under the ROC curve represents the classification ability of a
model, hence when comparing the ROC curves for the logistic regression model and the
linear discriminant model in Figures 1 and 2 respectively, it was noted that logistic regression
had a superior classification ability as it enclosed an area of 0.7150 compared to 0.7004 for
the linear discriminant ROC curve. However, their level of performance is not significantly
different (0.0146).
0.00
0.25
0.50
0.75
1.00
Sen
sitiv
ity
0.00 0.25 0.50 0.75 1.001 - Specificity
Area under ROC curve = 0.7150
Figure 1: Receiver Operating Characteristics (ROC) curve for Logistic Regression model.
37
0.2
5.5
.75
1S
ensi
tivity
0 .25 .5 .75 11 - Specificity
Area under curve = 0.7004 se(area) = 304.0297
Figure 2: Receiver Operating Characteristics (ROC) curve for Linear Discriminant Analysis model.
4.6 Linear Discriminant Analysis as an Exploratory Step for Logistic
Regression
Linear discriminant function analysis works by classifying a subject by calculating its
discriminant score as a linear combination of the strongly discriminating covariates. Thus
using these identified variables in a logistic regression model could possible improve
classification ability, hence the three factors in the linear discriminant model were fed into a
logistic regression. The resulting model was compared to the initial logistic regression model.
38
0.00
0.25
0.50
0.75
1.00
Sen
sitiv
ity
0.00 0.25 0.50 0.75 1.001 - Specificity
Area under ROC curve = 0.7024
Figure 3: Receiver Operating Characteristics (ROC) curve for the second Logistic Regression model.
Table 7: Comparison of the Two Logistic Regression Models
Attribute Logistic Regression Model 1 Logistic Regression Model 2
Log Likelihood -1986.4762 -4286.5091
P-value(χ2) 0.0000 0.0000
Pseudo R2 0.0999 0.0884
Sensitivity(at 0.5 cut-off) 13.19% 9.91%
Specificity(at 0.5 cut-off) 97.44% 97.96%
Accuracy(at 0.5 cut-off) 78.76% 78.91%
Area under ROC curve 0.7150 0.7024
39
Using the likelihood ratio test to compare the two models, the first model still remains
significant (p-value<0.001), the pseudo R2 for the first model is 10% compared to 9% for the
second model, the specificity for the two models is almost equal, model 1 has a slightly
higher sensitivity than model 2, the accuracy for the two models is the same (79%) and the
area under the ROC curves differs slightly (0.7150 and 0.7024) with the first model having
minimal superiority. These statistics thus indicate that using linear discriminant analysis as an
exploratory step to logistic regression does not enhance the prediction ability of the logistic
regression model.
40
CHAPTER FIVE: DISCUSSION
5.1 Discussion of Results
In general, the classification ability of the logistic regression model was as good as
that of the linear discriminant analysis model. However, logistic regression identified five
variables as important predictors of stunting whilst linear discriminant analysis identified
three factors with two of them in common. The signs of the coefficients was the same for the
two models but the magnitudes of the effects differed significantly. This study could
possibly be the first to try using logistic regression and linear discriminant analysis in
combination though the result did not improve the overall classification accuracy rate.
This study established that the prevalence of stunting in the selected subsample of the
Zvitambo cohort was 21.58%. The two methods collectively identified the following as
important predictors of stunting: gender, birth weight, birth interval, household income,
mother’s education level and gestational age. Upon comparison of their predictive accuracy,
logistic regression proved to be slightly more superior than linear discriminant analysis
though the overall classification rate for both methods could be classified as acceptable (lying
between 70% and 80%).4 When used in combination, thus applying logistic regression to the
predictors identified by linear discriminant analysis as highly discriminatory it was noted that
the predictive accuracy does not improve.
Several studies have looked at the risk factors of stunting in children in several
countries and they have identified a whole range of predictors. The finding that gender is an
important risk factor agrees with the meta-analysis carried out by Wamani et al (2007) of data
from 16 Demographic and Health Surveys from Sub-Saharan African countries of which
Zimbabwe was a part. They revealed that male children may actually be more vulnerable to
health inequalities than females of the same age group. In their study in Indonesia, Ramli et
41
al (2009) also identified male sex as a risk factor for stunting. Low birth weight emerged as
one of the principal risk factors for stunting just as it was also noted by Ricci et al (1996) in
the Phillipines and by Vitolo et al (2008) in Brazil. The same study by Vitolo et al (2008)
identified family income as a risk factor and that agrees with findings by Senbanjo et al
(2011) in Nigeria who also established that the mother’s level of education was also
associated with stunting in their children. This possibly being as a result of the feeding
practices and hygienic tendencies which are the mother’s responsibility and affect the child’s
development. The risk factors of stunting identified by this research concur with findings of
other studies on the same condition. Since stunting is a health indicator of chronic
malnutrition, identification of its risk factors would assist the crafting of public health
interventions aimed at reducing its impact.
The predictive accuracy of logistic regression and linear discriminant analysis were
compared and the two methods had a negligible difference in the maximum overall
classification rate, logistic regression-78.76% versus 78.86% for linear discriminant analysis.
This finding contrasts other studies that made the same comparison whereby logistic
regression emerged with slight superiority over linear discriminant analysis. Panagiotakos
(2006) compared the two methods in predicting death or survival of in-patients admitted with
Acute Coronary Syndrome (ACS) and found out that logistic regression had a maximum
overall classification rate of 96.8% compared to 81.4% for linear discriminant analysis.
Montgomery et al (1987), Antonogeorgos G, et al (2009) and Press and Wilson (1978) noted
that logistic regression had a maximum classification rate of 82.2%, 79.2% and 80%
compared to 77.5%, 77.4%, and 68% for linear discriminant analysis, respectively. However,
when comparing areas under the ROC curves for the two statistical methods, this research
revealed that logistic regression was just slightly better in its classification ability thus 71.5%
compared to 70.24%, whilst Panagiotakos (2006) and Antonogeorgos G, et al (2009) found
42
no differences in the areas, 81.8% and 74.6% for logistic model versus 81.1% and 74.4% for
discriminant model respectively. However, in the comparison of the two statistical methods
it should be noted that the assumptions of multivariate normality and equal covariance
matrices within the categories for the application of linear discriminant analysis were violated
just as observed in most studies that used real research data.28,32
It was noted in this study that at the cut-off points where logistic regression and
linear discriminant analysis perform best, they both had very low sensitivity and high
specificity in agreement with the findings of Antonogeorgos et al (2009), Montgomery et al
(1987) and Panagiotakos (2006). This revealed that both models would perform excellently in
identifying children who are not stunted, implying that whenever any of the two models
classifies someone as having the condition it would be highly likely that indeed that person
has the condition.35
Linear discriminant analysis classifies subjects on the basis of their characteristics of
some independent variables which help discriminate among the groups. The variables
identified as highly discriminatory by linear discriminant analysis were used in logistic
regression to check if classification ability would improve. The resulting model was not any
better in performance (70.04% compared to 71.5%), hence there was no value addition in
applying the two statistical methods in combination. However, none of the studies reviewed
used the two methods in combination.
The findings of this study identified some risk factors for stunting which are
important for public health policy formulation, specifically with reference to health
promotion interventions aimed at reducing national stunting prevalence. The comparison
between logistic regression and linear discriminant analysis which resulted in negligible
differences in classification abilities would help researchers when faced with making a choice
between the two methods. In this study the two methods converged, possibly because of the
43
large sample size that was used, as literature states that the two tend to produce similar results
as sample size increases beyond 50.28 However, it should still need to be noted that the most
appropriate method to be used for analysis should always be the one whose assumptions are
not seriously violated.
5.2 Limitations of the study
The findings of this study were possibly compromised by the violation of assumptions
of multivariate normality and equal covariances within the groups which normally produces
unstable estimates for linear discriminant analysis. This violation may have been caused by
the inclusion of categorical variables, some of which had just two categories(e.g gender)
which does not conform to the requirement that categorical variables may be used in linear
discriminant analysis if they have many categories (5-6). The research was unable to
compare computation time taken by the two methods as that comparison could also help
researchers when choosing which method to use for analysis and this was done in similar
studies. Another point of comparison would have been the order of selection of variables into
the model but this was not possible as linear discriminant analysis used a simultaneous
approach and not a stepwise approach that was used for logistic regression analysis. External
validity of the findings of this study was limited by the fact that the data that was used was
collected from urban infants and so cannot be generalized to all infants of the same age.
Using data collected by a national survey such as the Demographic and Health Survey would
help overcome this shortcoming. Despite these limitations, this study remains important in
that it validated findings of studies carried out in other countries and identified some
important risk factors for stunting and estimated its prevalence in Harare, Zimbabwe.
44
CHAPTER SIX: CONCLUSION
The two statistical methods, logistic regression and linear discriminant analysis
proved to have the same classification ability. Convergence was evidenced by their overall
classification rates that were equal. At their maximum classification ability level both
methods were observed to have very high specificity and low sensitivity, implying that when
a subject was classified as positive by any of the methods it would be highly likely that they
indeed were positive. It was established that there was no value added in classification ability
when the two methods were used in combination, hence the choice to use any one of them
would have to be made by considering the assumptions for the application of each of them.
The similar signs for coefficients of the two methods indicate convergence with respect to the
direction of the effect of the factors identified. Logistic regression identified sex, birth
weight, birth interval, household income and mother’s education level as important predictors
of stunting, whilst linear discriminant analysis identified sex, birth weight and gestational
age.
This study would help researchers especially statistical analysts that when making a
choice of the statistical method to use it is of fundamental importance to consider the
assumptions for the application of each as violation of assumptions may distort results. The
results of this research have also shown that linear discriminant analysis and logistic
regression tend to converge in their classification capability when the sample size is
large(>50).
45
REFERENCES
1. Agresti A. (2007).An Introduction to Categorical Data Analysis,2nd Edition, John Wiley & Sons, Canada.
2. Evans J. (1998). Epidemiology in Practice: Randomized Controlled Trials; community Eye Health; 11(26): pp 26-27.
3. Freedman K.B, Back S. and Bernstein J. (2001). Sample Size and Statistical Power of Randomized Controlled Trials in Orthopaedics; The Journal of Bone and Joint Surgery (Br) 2001; 83-B(3): 397-402.
4. Hosmer D.W and Lemeshow S. (2000).Applied Logistic Regression,2nd Edition, John Wiley and Sons, Canada.
5. http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/discrim.pdf, 13/06/2014 6. Humphrey J.H et al (2006). Effects of a Single Large Dose of Vitamin A Given
during the Postpartum Period to HIV-Positive Women and their Infants on Child HIV Infection, HIV-free Survival and Mortality, Journal of Infectious Diseases 2006, 193: 860-871.
7. Jiang Y. et al (2014). Prevalence and Risk Factors for Stunting among Children under 3 years in Mid-western Rural Areas of China. Child Care, Health and Development. doi.10.1111/cch.12148.
8. Krzanowski W.J (1986). Multiple Discriminant Analysis in the Presence of Mixed Continuous and Categorical Data, Computers and Maths with Applications, 1986, 12A(2): pp 179-185.
9. Kutner M.H et al. (2005). Applied Linear Statistical Models, 5th Edition, MacGraw-Hill/Irwin, New York.
10. Overall J.E and Woodward J.A, (1977). Discriminant Analysis with Categorical Data, Applied Psychological Measurement 1977, 1(3): pp 371-384.
11. Paudel R. et al (2012). Risk Factors for Stunting among Children: a community-based case control study in Nepal, Kathmandu University Medical Journal 2012, 10(39):18-24.
12. Pohar M, Blas M. and Turk S. (2004). A Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study, Metodološki zvezki 2004, 1(1): pp 143-161.
13. Prendergast A.J et al. (2014). Stunting is Characterized by Chronic Inflammation in Zimbabwean Infants; PLoS ONE 9(2): e86928.doi: 10.1371/journal.pone.0086928.
14. Press S.J & Wilson S. (1978).Choosing Between Logistic Regression and Discriminant Analysis. Journal of the American Statistical Association,73,pp699-705.
15. Ramli et al. (2009). Prevalence and Risk Factors for Stunting among Under Fives in North Maluku Province of Indonesia, BMC Pediatrics 2009;9:64.
16. Ricci J.A and Becker S. (1996). Risk Factors for Wasting and Stunting among Children in Metro Cebu, Phillipines, American Journal of Clinical Nutrition, 63(6):pp 966-975.
46
17. Senbanjo I.O et al. (2011). Prevalence and Risk Factors for Stunting among School Children and Adolescents in Abeokuta, Southwest Nigeria, Journal of Health and Population Nutrition, 29(4): pp 364-370.
18. Stolberg H.O et al. (2004). Fundamentals of Clinical Research for Radiologists: Randomized Controlled Trials; American Journal of Roentgenology 2004; 183: 1539-1544.
19. Vitolo M.R et al (2008). Some Risk Factors Associated with Overweight, Stunting and Wasting among Children under 5 years old, Journal de Pediatria(Rio J), 2008, 84(3):pp 251-257.
20. Wamani H. et al (2007). Boys are more Stunted than Girls in sub-Saharan Africa: a meta-analysis of 16 Demographic and Health surveys. BMC Pediatrics 2007, 7(17).
21. www.sagepub.com/upm-data/5081_Spicer_Chapter_5.pdf,16/06/2014 22. www.who.int/ceh/indicators/0=4stunting.pdf, 16/06/2014 23. Zvitambo Study Protocol Document. 24. Shrimpton R, et al (2001). Worldwide Timing of Growth Faltering: Implications for
Nutritional Interventions. Pediatrics 2001; 107: 1-7. 25. United Nations Administrative Committee on Coordination/Sub-Committee on
Nutrition. Fourth Report on the World Nutrition Situation: Nutrition Throughout the Life Cycle. Geneva, Switzerland: United Nations Administrative Committee on Coordination/Sub-Committee on Nutrition; 2000.
26. Humphrey J.H, (2009). Child Undernutrition, Tropical Enteropathy, Toilets and handwashing. Lancet 2009; 374: 1032-35.
27. Wamani H, et al (2004). Mothers’ education but not fathers’ education, household assets or land ownership is the best predictor of child health inequalities in rural Uganda. International Journal Equity in Health 2004; 3:9.
28. Antonogeorgos G, et al (2009). Logistic Regression and Linear Discriminant Analyses in Evaluating Factors Associated with Asthma Prevalence among 10-12-Years-Old Children: Divergence and Similarity of the Two Statistical Methods. International Journal of Pediatrics 2009: 952042.
29. Harrell F.E and Lee K.L. (1985). A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In P.K.Sen(Ed.): Biostatistics: Statistics in Biomedical, Public Health and Environmental Sciences. North Holland: Elsevier Science Publishers, 333-343.
30. Nduati R.W, John G.C, Richardson B.A, et al. Human immunodeficiency virus type 1-infected cells in breast milk: association with immunosupression and vitamin A deficiency. Journal of Infectious Diseases 1995; 172:1461-8.
31. Zimbabwe National Statistics Agency(ZIMSTAT) and ICF International.2012. Zimbabwe Demographic and Health Survey 2010-11. Calverton, Maryland: ZIMSTAT and ICF International1.Inc.
32. Montgomery M.E, White M.E and Martin S.W. A Comparison of Discriminant Analysis and Logistic Regression for the Prediction of Coliform Mastisis in Dairy Cows. Canadian Journal of Veterinary Research. 1987; 51(4)pp495.
47
33. Panagiotakos D.B. A Comparison between Logistic Regression and Linear Discriminant Analysis for the Prediction of Categorical Health Outcomes. International Journal of Statistical Sciences. 2006; 5:pp 73-84.
34. Morrison D.G. On the Interpretation of Discriminant Analysis. Journal of Marketing Research. 1969; 6: 156-63.
35. Akobeng A. Understanding Diagnostic Tests 1: Sensitivity, Specificity and Predictive Values. Acta Paedrica 2006; 96: pp 338-341.
48
Appendix A: Joint Research Ethics Committee Approval
49
Appendix B: Letter of Authorisation to use Data
50
Appendix C: Logistic Regression and Linear Discriminant Analysis
Logistic Regression
Just like multiple linear regression, logistic regression attempts to come up with a
composite function of multiple independent variables that predicts the probability [π(x)] of a
case being in a given category21, in this case the probability of being stunted. The composite
function takes the form:
π(x) = α + βx (1)
This implies that being a probability, the result of the composite function should
always lie between 0 and 1, which is not the case when coefficients are calculated using
ordinary least squares like in linear regression. This method forces linearity to a relationship
that is more likely S-shaped as it gives values outside the 0-1 range. The solution to this
violation lies in introducing another probability index called the odds, which looks at the
probability of being in a category over the probability of not being in that category. Thus the
probability itself is actually the odds divided by one plus the odds yielding the following
mathematical function which is exponential1:
π(x) = eα + βx / 1 + eα + βx (2)
Thus taking the natural logarithm of the odds instead of the probabilities would revert back to
the composite function of the multiple independent variables with all its usual properties
resulting in the logistic regression model:
log [π(x)/ 1 - π(x)] = α + βx (3)
51
The logistic regression model is a special case of the generalized linear model, with a
binomial random component(stunted or not stunted), a logit link function (log π/1- π) and a
linear systematic component(α + βx) which can assume any value on the real number line.1
The estimation of parameters would produce log odds that in turn result in predictive
probabilities that accurately classify cases into the correct category. The predictive
probabilities and the actual categories give a log likelihood function such that the desirable
coefficients are those that maximize this function. Thus when different logistic regression
models are compared the best model would be that with the highest log likelihood value as it
would have the best predictive power.4 Logistic regression quite easily handles categorical
independent variables by the use of design or dummy variables. With a cut-off probability of
0.5, a case can be classified into one group if the resulting probability is less than 0.5 or into
the other group if the probability is greater than 0.5.28
Logistic regression does not have rigid assumptions to be fulfilled hence its vast applicability
in multivariable analysis. These assumptions include:
• The true conditional probabilities are a logistic function of the independent variables.
• No important variables are omitted.
• No extraneous variables are included.
• The independent variables are measured without error.
• The observations are independent.
• The independent variables are not linear combinations of each other.
Thus, multicolinearity of the predictor variables may need to be checked by treating one of
these variables as a pseudo-dependent variable and regressing all the others against it.
Independency of the individual cases may also need to be verified by using residual plots.9
52
Linear Discriminant Analysis
Discriminant analysis performs the same task as multiple linear regression, that of
using various independent variables to determine an outcome, in this case which has several
categories (two or more outcome categories)9. It is used to predict group membership or
category membership based on a linear combination of the predictor variables4. The process
can also determine which of the predictor variables discriminate between two or more
naturally occurring mutually exclusive and exhaustive groups, where there is no natural
ordering in the groups14. This method requires that the independent variables be normally
distributed and the covariance matrices should be equal.
The discriminant analysis procedure starts with a set of observations where both
group membership and values for the predictor variables are known. The product of the
process is a model that can be used to predict group membership when only predictor
variables are known14. Another purpose of discriminant function analysis is an understanding
of the data set, since examination of the resulting prediction model can highlight the
relationship between group membership and the variables used for prediction. Predictive
discriminant analysis provides a way of assigning new cases to groups. It uses the new case’s
scores on the predictor variables to predict the category to which the case belongs. Statistical
significance tests using chi-square allow one to see how well a function separates groups.
This analytical technique works by formulating a new variable, the discriminant
function score which is used to predict to which group a case belongs. The result is a
function or equation similar to a multiple linear regression equation:
D = a + v1Age + v2Education + v3Birthweight + ………..+ vn Feeding (4)
Where D = discriminate score(stuntedness);
53
vi = the ith discriminant coefficient;
Xi = respondent’s score for the ith variable (e.g Age);
a = constant;
n = the number of predictor variables.
After determining the discriminant function, the discriminant score for each
observation is calculated and each case is classified into the appropriate group depending on
the cut-off point used for classification. This technique pivots on a statistic called the
eigenvalue, which is the ratio of the between-group/within-group sum of squares such that the
best discriminant coefficients are those that maximize the eigenvalue for the composite
variable5.
54
Appendix D: Zvitambo Questionnaire Used for Data Collection
55
56
57
58
59
60
61
62
63
64
Recommended