Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ANALYSIS OF CONSUMPTION BEHAVIORS AND MARKET STRUCTURE WITH
EXCESS ZEROS AND OVER-DISPERSION
By
YUAN JIANG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
© 2018 Yuan Jiang
To my Mom and Dad
4
ACKNOWLEDGMENTS
I would like to take this opportunity to give my gratitude to those who have helped me in
the way of completing my dissertation. I thank my supervisor Dr. Lisa House, whose profound
knowledge in the field provides me valuable advice and sparks my insightful thinking. I am
grateful for her tremendous help, support, patience and encouragement throughout this
wonderful journey of challenge and fulfillment. All of the achievements during my study would
not have been possible without her guidance and advice.
I would also express my great appreciation to all the professors on my committee. Dr.
Zhifeng Gao has offered me considerable help and guidance in various aspects, including the
simulation study design, data analysis, and cheerful encouragement. I also thank Dr. Brandon
Mcfadden, Dr. Hyeyoung Kim and Dr. Zhihua Su for their suggestions and help along the
completion of this dissertation.
I would like to thank the Food and Resource Economics Department for providing me the
chance to study and obtain valuable professional training. Without its support, nothing would be
possible. In addition, I would express a heartfelt appreciation for my dear friends and fellow
graduate students for their support and encouragement.
From the bottom of my heart, I want to express my gratitude to all of my family for their
love and support, especially my parents.
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ...............................................................................................................4
LIST OF TABLES ...........................................................................................................................7
LIST OF FIGURES .......................................................................................................................10
ABSTRACT ...................................................................................................................................11
CHAPTER
1 INTRODUCTION ..................................................................................................................13
2 COMPARISON OF THE PERFORMANCE OF COUNT DATA MODELS UNDER
DIFFERENT ZERO-INFLATION SCENARIOS USING SIMULATION STUDIES .........20
Background .............................................................................................................................20
Literature Review ...................................................................................................................23 Count Data and Generalized Linear Model .....................................................................23 Poisson Regression Model and Applications ..................................................................25
Problems with the Poisson Regression ............................................................................25 Negative Binomial Regression Models and Applications ...............................................27
Zero-inflated Models and Applications ...........................................................................29 Hurdle Model and Applications ......................................................................................32
Comparison of the Models ..............................................................................................35 Gaps and Shortcomings ...................................................................................................37
Method ....................................................................................................................................38 Research Questions .........................................................................................................38 Monte Carlo Simulation ..................................................................................................39
Simulation Study Design .................................................................................................40 Model Evaluation of the Simulation Studies ...................................................................43 Data Generating Mechanism ...........................................................................................44
Results.....................................................................................................................................44
Pseudo-Population Zero-inflated Poisson Model ............................................................44
Model fit ...................................................................................................................45
Relative bias of E(Y|X) ............................................................................................45 Abilities of capturing zero observation ....................................................................46
Pseudo-Population Hurdle Poisson Model ......................................................................47 Model fit ...................................................................................................................48 Relative bias of E(Y|X) ............................................................................................48
Abilities of capturing zero observation ....................................................................49 Pseudo-Population Zero-inflated Negative Binomial Model ..........................................49
Model fit ...................................................................................................................50 Relative bias of E(Y|X) ............................................................................................51 Abilities of capturing zero observation ....................................................................52
6
Pseudo-Population Hurdle Negative Binomial Model ....................................................53 Model fit ...................................................................................................................54 Relative bias of E(Y|X) ............................................................................................54 Abilities of capturing zero observation ....................................................................56
Compare the Mis-specified Models across All the True Models ....................................56 Conclusion ..............................................................................................................................58
Model Performance Given Four Pseudo-Populations .....................................................60 True model - Zero-inflated Poisson Model ..............................................................60 True model - Hurdle Poisson Model ........................................................................61
True model - Zero-inflated Negative Binomial Model ............................................61 True model - Hurdle Negative Binomial Model ......................................................62
Comparison between Different Models ...........................................................................63
Poisson formulation versus Negative-Binomial formulation ...................................63 Zero-inflated models versus Hurdle models ............................................................66
Capability of Predicting Zero Observations ....................................................................67
Future Work and Limitations ..........................................................................................68
3 A TRIPLE HURDLE COUNT DATA MODEL OF MARKET PARTICIPATION
AND CONSUMPTION ..........................................................................................................87
Background .............................................................................................................................87 The Econometric Modeling of Count Data for Consumption Behavior .................................89
Motivation ...............................................................................................................................92 Conceptual Framework ...........................................................................................................94
Econometric Framework ........................................................................................................95 Triple Hurdle Count Data Model with Independent Stages ............................................95
Triple Hurdle Count Data Model with Interdependence .................................................98 Marginal Effects and Interpreting Results .....................................................................102
Comparing Triple Hurdle Count Data Model and the Double Hurdle Models .............105 Variables and Data ................................................................................................................106
Data Set .........................................................................................................................106
Variables ........................................................................................................................107 Results...................................................................................................................................109
Regression results of Triple Hurdle Count Data Model Results ...................................110 Marginal Effects of the Triple Hurdle Count Data Model ............................................114
Conclusion ............................................................................................................................116
4 DISCUSSION .......................................................................................................................128
LIST OF REFERENCES .............................................................................................................131
BIOGRAPHICAL SKETCH .......................................................................................................136
7
LIST OF TABLES
Table page
2-1 Convergence rate, true model is ZIP.................................................................................70
2-2 Mean Loglikelihood, true model is ZIP .............................................................................70
2-3 Mean AIC, true model is ZIP .............................................................................................70
2-4 Relative Bias for E(Y|X), true model is ZIP ......................................................................71
2-5 Observed and predicted zero observations, true model is ZIP ...........................................71
2-6 Convergence rate, true model is HP ..................................................................................71
2-7 Mean Log-likelihood, true model is HP ............................................................................72
2-8 Mean AIC, true model is HP..............................................................................................72
2-9 Relative Bias for E(Y|X), true model is HP .......................................................................72
2-10 Observed and predicted zero observations, true model is HP ...........................................73
2-11 Convergence rate, true model is ZINB ..............................................................................73
2-12 Mean Log-likelihood, true model is ZINB (disperison=0.5) .............................................73
2-13 Mean AIC, true model is ZINB (disperison=0.5) ..............................................................74
2-14 Mean loglikelihood, true model is ZINB (dispersion=1) ..................................................74
2-15 Mean AIC, true model is ZINB (disperison=1) .................................................................74
2-16 Mean loglikelihood, true model is ZINB (dispersion=2) ..................................................75
2-17 Mean AIC, true model is ZINB (disperison=2) .................................................................75
2-18 Mean Loglikelihood, true model is ZINB (disperison=4) .................................................76
2-19 Mean AIC, true model is ZINB (disperison=4) .................................................................76
2-20 Mean AIC of ZINB model, true model is ZINB ................................................................76
2-21 Relative Bias for E(Y|X), true model is ZINB (dispersion=0.5) .......................................77
2-22 Relative Bias for E(Y|X), true model is ZINB (dispersion=1) ..........................................77
2-23 Relative Bias for E(Y|X), true model is ZINB (dispersion=2) ..........................................77
8
2-24 Relative Bias for E(Y|X), true model is ZINB (dispersion=4) ..........................................78
2-25 Relative Bias for E(Y|X) of ZINB model, true model is ZINB .........................................78
2-26 Observed and predicted zero observations, true model is ZINB (disipersion=0.5) ...........78
2-27 Observed and predicted zero observations, true model is ZINB (disipersion=1) ..............79
2-28 Observed and predicted zero observations, true model is ZINB (disipersion=2) ..............79
2-29 Observed and predicted zero observations, true model is ZINB (disipersion=4) ..............79
2-30 Convergence rate, true model is HNB ...............................................................................80
2-31 Mean Loglikelihood, true model is HNB (disperison=0.5) ...............................................80
2-32 Mean AIC, true model is HNB (disperison=0.5) ...............................................................80
2-33 Mean Loglikelihood, true model is HNB (disperison=1) ..................................................81
2-34 Mean AIC, true model is HNB (disperison=1) ..................................................................81
2-35 Mean Loglikelihood, true model is HNB (disperison=2) ..................................................81
2-36 Mean AIC, true model is HNB (disperison=2) ..................................................................82
2-37 Mean Loglikelihood, true model is HNB (disperison=4) ..................................................82
2-38 Mean AIC, true model is HNB (disperison=4) ..................................................................82
2-39 Mean AIC of HNB, true model is HNB ............................................................................83
2-40 Relative Bias for E(Y|X), true model is HNB (dispersion=0.5) ........................................83
2-41 Relative Bias for E(Y|X), true model is HNB (dispersion=1) ...........................................83
2-42 Relative Bias for E(Y|X), true model is HNB (dispersion=2) ...........................................84
2-43 Relative Bias for E(Y|X), true model is HNB (dispersion=4) ...........................................84
2-44 Relative Bias for E(Y|X) of HNB model, true model is HNB ...........................................84
2-45 Observed and predicted zero observations, true model is HNB(dispersion=0.5) ..............85
2-46 Observed and predicted zero observations, true model is HNB(dispersion=1) .................85
2-47 Observed and predicted zero observations, true model is HNB(dispersion=2) .................85
2-48 Observed and predicted zero observations, true model is HNB(dispersion=4) .................86
9
2-49 Average AIC statistics across all the models .....................................................................86
2-50 Average Relative Bias across all the models .....................................................................86
3-1 Variable Descriptions.......................................................................................................120
3-2 Estimated probabilities for fresh blueberry consumption ................................................122
3-3 Fresh blueberry consumption: summary statistics from double hurdle approach and
triple hurdle model ...........................................................................................................122
3-4 Fresh blueberry consumption: regression results .............................................................123
3-5 Marginal Effects for Triple Hurdle Count Data Model ...................................................125
3-6 Comparison of the marginal effects for Triple Hurdle Count Data Model and Double
Hurdle Count Model ........................................................................................................126
10
LIST OF FIGURES
Figure page
3-1 Zero consumption of fresh blueberry per month .............................................................119
3-2 Diagram of the data generating process of the Triple Hurdle Count Data model ...........119
11
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
ANALYSIS OF CONSUMPTION BEHAVIORS AND MARKET STRUCTURE WITH
EXCESS ZEROS AND OVER-DISPERSION
By
Yuan Jiang
May 2018
Chair: Lisa House
Major: Food and Resource Economics
There is a long history of interest in modeling consumer behavior and predicting market
structure in agricultural economics. When analyzing consumption behavior at the individual
level, the data is frequently formatted as count data, especially when measuring consumption
frequency or intensity during a given period. However, this type of data is often characterized by
an excess of zero observations (zero-inflation) and a heavy right tail (dispersion). These factors
influence whether or not the Poisson regression model, typically used with count data, can be
appropriately applied.
To solve the shortcomings of the Poisson regression model, a number of modified
Poisson regression models have been developed. The most popular approaches are the Zero-
inflated/modified model and the Hurdle model, which differ based on different assumptions
about the sources of zero observations. The first part of this dissertation aims to review and
evaluate the performance of these modified models given different levels of zero-inflation and
over-dispersion using a simulation study regarding model fit and prediction capability.
Furthermore, special attention will be given to the comparison of the ability of the models to
predict the correct latent classes, as well as understanding the consequences of model
12
misspecification when the data-generating mechanism is improperly specified. Based on this
analysis, the Zero-inflated models are preferred to the Hurdle models, especially regarding model
fit and prediction capability. If the assumption of zero is stressed and interest is focused on the
accuracy of zero predictions, then the Hurdle models should still be considered. What is more,
this analysis also verified that the models with Negative Binomial formulations are preferred to
model with Poisson formulations in the case when the data has the problem of over-dispersion.
The second part of this dissertation proposes a new approach, a Triple Hurdle Count Data
model to analyze consumption behavior. This new model extends the Zero-inflated models and
Hurdle Count Models to market participation modeling. It will allow us to identify consumer
participation, desire, and acquisition separately, and to explore the appropriate structurally
different reasons explaining consumers’ decisions on market participation, consumption
intention, and consumption intensity in sequence. The new model is applied to a consumer
choice problem of blueberry consumption to discuss the difference in insights gained by
employing the Triple Hurdle Count Data model compared to the Double Hurdle approach.
13
CHAPTER 1
INTRODUCTION
There is a long history of interest in modeling consumer behavior and predicting market
segmentation in economics, in particular, understanding consumers’ preferences and
consumption. When analyzing consumer behavior on the individual level, especially when using
a survey to collect the primary data, consumption is frequently recorded in the form of count
data. Count data usually occurs when measuring consumption frequency or intensity during a
certain period. The focus and motivation of this dissertation is on how to model and understand
consumption behavior using count data, and how to better predict market segmentation using
appropriate econometric methods.
To analyze consumer behavior, and to predict the market segmentation, it is very
important to explore the factors that influence consumption behavior, which includes decisions
on both market participation and consumption. However, data on the frequency of how often
consumers choose to purchase in a given period presents an interesting statistical challenge as
there are many observations recorded as zero. For example, in a survey eliciting information on
blueberry consumption, answers to the question “How often did you consume blueberries last
month?” will include many respondents answering none (zero), and the number of zero
observations will vary depending on the time of the year. Typically, this type of data will be
characterized by an excess of zero observations (or zero-inflation) that influences modeling and
interpretation of the data.
Regarding consumption decisions, these abundant zero-observations can be the result of
three different reasons. First, it is possible that some individuals have a non-positive desire for
the product. In other words, these individuals will not be consumers of the product because of
some permanent reason (such as an allergy). Second, while some individuals do desire to
14
purchase a product, they do not consume for some temporary reason (for example, the current
price of the product is greater than the upper bound of their willingness, or ability, to pay at the
given income level). In this case, the zero consumption is the corner solution for the individual’s
utility-maximizing decision. Third, some individuals have positive market participation desire,
but zero consumption could be observed due to the infrequency of purchase (for example, they
purchase, but not during the period surveyed). This situation could often happen in the case of
durable goods.
Regarding market segmentation, individuals who choose zero consumption because of
the first reason are considered non-consumers, and these corresponding zero observations are
called structural zeros. The individuals who have positive market participation desire but were
observed consuming zero units because of the second and third reasons are considered as
potential consumers, and the corresponding zero observations are called sampling zeros. Thus,
the particular interpretations given to these zero consumption observations will have a crucial
bearing on the estimation techniques, and the interpretation of results, especially for market
segmentation.
Considering the appropriate statistical modeling, when the dependent variable is count
data, one of the most commonly used statistical methods is the Poisson regression. Poisson
regression is commonly used by economists to model the number of events, like the frequency of
consumption. However, the Poisson model fails to provide an adequate fit when there exists the
problem of “excessive zeros.” When “excessive zeros” exist, the data mean is pulled towards
zero, causing a violation of the assumption of mean-variance equality which is part of the
Poisson regression model.
15
To address this potential shortcoming of the Poisson regression model, a number of
modified Poisson regression models have been developed. The Zero-inflated Poisson (ZIP)
model was first proposed by Lambert (1992) and is commonly used. Following this, the Zero-
inflated Negative Binomial was developed to further handle the problem of data over-dispersion,
as well as to address the issue of inequality of the mean and variance (Consul and Jain, 1973;
Famoye and Singh, 2006).
The Zero-inflated count data models assume that the zero observations come from two
distinct sources: “sampling zeros” and “structured zeros.” When applied to consumption
analysis, zero-inflated count data models allow for zero-consumption to come from both cases
where the consumer is a genuine non-participant (structural zero), and when zero consumption is
the corner solution of a standard consumer demand problem (sampling zero).
Different from the Zero-inflated count data model, the Hurdle model proposed by
Mullahy (1986) assumes that all the zeros are structural zeros. When applied to consumption
analysis, Hurdle models assume that individuals need to pass two stages before being observed
with a positive consumption intensity: a participation decision and a consumption decision.
Furthermore, Hurdle models assume the participation stage dominant. Thus, all zero observations
are assumed to be generated in the first stage (decision on whether to consume), and in the
second stage, the consumption behavior is truncated at zero. Thus, all zeros are treated as
structural zeros, whether they are structural or sampling zeros. Hurdle Negative Binomial models
have also been developed later (Gurmu, 1998), with the purpose to better handle the issue of
over-dispersion.
The choice between Hurdle models and Zero-inflated models should be based on whether
the researcher believes that all zero observations come from the structural zero group or that at
16
least some of the zeros are sampling random zeros. Based on consumption analysis research, the
choice between Hurdle models and Zero-inflated models should mainly depend on whether or
not the researchers believe that the potential consumers (those that would consume under the
correct circumstances) exist in the market. However, if there is not a clear definition between the
two groups of zeros in the market, then the choice between these two models would depend on
the model performance, which includes both model fit and predictive capabilities.
There is relatively little literature comparing and evaluating the performance of these
count data models, and results from the existing research are conflicting with respect to which
model is superior. For example, Green (1994) found that the negative binomial model was
superior to the ZIP model, and the ZIP model was superior to the Poisson model. Conversely,
Lambert (1992) argued that the ZIP model has superior fit compared to the Negative Binomial
model. Needlon et al. (2010) found that the ZIP model fits better than the Poisson and Hurdle
models, while Welsh et al. (1996) found the Hurdle and ZIP models to be equal. Based on Miller
(2007)’s research, the discrepant results of the model comparisons might because the datasets
used in the analyses are quite different in the proportion of zeros, with some research using data
with as low as 20% zeros, and some datasets with as high as 90% zeros.
Additionally, there have been fewer studies comparing the different count data models
with zero-inflation and over-dispersion using simulated data. Lambert (1992) proposed the zero-
inflated Poisson model and evaluated its performance using simulation studies. Miller (2007)
compared the Poisson, hurdle, and zero-inflated models under varying zero-inflation levels; and
Desjardins (2013) evaluated the performance of the zero-inflated negative binomial and negative
binomial hurdle models under simulation. However, most of these previous studies mainly focus
17
on model fit and parameter recovery, and rarely cared about the models’ capacity of predicting
the different zero types.
Focusing on the comparison of Zero-inflation and Hurdle models, there is even less
comparison from previous research. A critical assumption of Zero-inflated models is that there
exists both structural and sampling zeros, yet no simulation studies analyzed whether the Zero-
inflated models can efficiently predict the group of structural zeros from the sampling zeros. If
the Zero-inflated models cannot differentiate the two types of zeros, then the utility of using the
Zero-inflated models over Hurdle models is limited. Previously, only Desjardins (2013) tried to
compare the Zero-inflated Negative Binomial model with the Hurdle Negative Binomial model
given a fixed level of zero proportion. However, no previous studies have examined the Zero-
inflated models versus Hurdle models with both Poisson and Negative Binomial formulations
given different levels of zero proportion and over-dispersion levels. Therefore there is no
guidance from prior research that examines how different levels of zero portion and over-
dispersion will impact the Zero-inflation models and Hurdle models on model fit and capabilities
of predicting different types of zero observations. For analysis of consumption behavior, how the
models make forecasts of different types of consumers (i.e. different reasons for zeros), is very
important for predicting and making recommendations related to market segmentation. It
becomes even more critical when choosing between Hurdle models, which assume no sampling
zeros, and Zero-inflated models, which allow both structural and sampling zeros.
To my knowledge, there have been no simulation studies that compare Zero-inflated
models and Hurdle models with both Poisson distribution and Negative binomial formulations,
and no prior studies that investigate how zero proportion and over-dispersion will affect model
fit in these models. Furthermore, regarding the importance of differentiating different types of
18
zeros, particular attention should be given to the comparison of different models’ capabilities to
predict the correct latent classes, which has not yet been explored by the previous research. In the
case of consumption analysis, we will compare the models’ capabilities of predicting market
segmentation.
Additionally, in the analysis of consumption behavior, all count data models listed above
have been employed to analyze consumers’ consumption intention and intensity. Unfortunately,
none of these models can specify the consumers’ actual consumption desire in the model
specifications. Although the Hurdle models allow consumption behavior to be divided into two
stages, and the Zero-inflated models assume the zero observations could either be non-
consumers or potential consumers, they could not differentiate the two different types of zero
observations in the model specification, thus they fail to differentiate the two types of zero
observations by observing their true consumption desire. What is more, both the Zero-inflated
and Hurdle models are designed on a two-stage structure. Although they assume that the factors
influencing non-consumers are different from others, they still impose a strong restriction that
the same set of factors influences potential consumers and consumers with positive consumption
intensity, which may not always be true.
To overcome the shortcomings mentioned above, it is important to develop a new count
data model which is able to test and treat for the two types of zero observations separately in the
analysis of consumption behavior and to allow three different sets of factors influencing the three
different groups: non-participants, potential consumers, and consumers. Thus, a Triple Hurdle
count data model is developed, which allows for the identification of non-consumers (in the first
hurdle), potential consumers (in the second hurdle), and consumption intensity (in the third
stage). This new model has better capability in market segmentation analysis, and it also allows a
19
deeper insight of the characteristics of different types of consumers. In order to generalize the
findings of this research, the new model will be compared with the double hurdle approach by
applying both models to the analysis of the fresh blueberry consumption.
The structure of the dissertation is as follows. In Chapter 2, a review and comparison of
the performance of the six most popular methods of modeling count data, especially when the
data has the issue of zero-inflation and dispersion is provided. This includes the basic Poisson
models and a discussion about several remedies to the Poisson models (including negative
binomial models, zero-inflated Poisson model, zero-inflated negative binomial model, hurdle
Poisson model and hurdle negative binomial model). For each model, we will discuss their
characteristics and applications in the field of consumption behavior. This chapter will also
include an evaluation of the six models using simulation studies under different levels of
dispersion and zero proportions, and compare the model's performance regarding model fit,
predictive capability, and particularly the capability to predict zero observations.
In Chapter 3, a new approach, a Triple Hurdle Count Data model, is proposed to analyze
consumption behavior, which assumes a three-stage decision making process. By differentiating
three different types of consumers in the model specification, this model allows three different
generating processes correlated with the three different types of consumers. This model will be
applied to a consumer choice problem of blueberry consumption together with the Double
Hurdle approach, to discuss the difference in insights gained by employing this new model.
Finally, Chapter 4 concludes with a discussion of the results and implication for the
consumption behavior analysis when the data employed is the count data with the issue of zero-
inflation and over-dispersion. A discussion about the appropriate statistical methods when trying
to understand the market structure and segmentation is also provided.
20
CHAPTER 2
COMPARISON OF THE PERFORMANCE OF COUNT DATA MODELS UNDER
DIFFERENT ZERO-INFLATION SCENARIOS USING SIMULATION STUDIES
Background
In statistical modeling, when the dependent variable is formatted as count data, the most
popular regression technique is the Poisson regression model. However, the Poisson model fails
to provide an adequate fit when there exists the problem of zero-inflation. Thus, the Poisson
model has been modified to address this issue. The most popular modifications are the Zero-
inflated/Modified Poisson and Hurdle Poisson models. Further, there are Negative Binomial
variations of these models considering the possible issue of dispersion.
The Zero-inflated Poisson (ZIP) model was proposed by Lambert in 1992. Following
this, a number of related models have been proposed, including the Poisson-Negative Binomial
and modified Poisson to address the inequality of the mean and variance (as equality is assumed
for the Poisson distribution) (Famoye and Singh, 2006). The Zero-inflated count data model
assumes that the zero observations come from two distinct sources and are identifed separately as
“sampling zeros” and “Structural zeros.” An example of the different types of zeros can be seen
when analyzing consumption of food products, where zero-consumption could be recorded
when the consumer is genuine non-participant (structural zero), or when the zero consumption is
the corner solution of a standard consumer demand problem (sampling zero).
Different from the Zero-inflated count data model, the Hurdle models proposed by
Mullahy (1986) assume that all zeros are sampling zeros. When applied to consumption analysis,
Hurdle models assume individuals need to pass two stages before being observed with a positive
level of consumption: a participation decision and a consumption decision. Furthermore, the
Hurdle models assume participation dominance, which indicates that all zero observations are
21
assumed generated in the first stage (whether or not to consume), and in the second stage,
consumption behavior is truncated at zero.
Thus, the choice between Hurdle models and Zero-inflated models is typically based on
whether the researcher believes that all the zero observations are coming from the structural zero
group or that some of the zeros are also sampling zeros.
There has been relatively little literature that has compared or evaluated the performances
of these count data models, and the results of the studies that have examined this subject vary in
their conclusions. For example, Green (1994) found that the Negative Binomial model was
superior to the ZIP model, and the ZIP model was superior to the Poisson model interms of
model fit. owever, Lambert (1992) argued that ZIP model had superior model fit to the Negative
Binomial model regarding the prediction error. Needlon et al. (2010) found that the ZIP model
fits better than the Poisson and Hurdle models, while Welsh et al. (1996) found that the Hurdle
and ZIP models to be equal.
Based on Miller’s (2007) research, the discrepant results of the model comparisons might
be because the datasets they employ are quite different in the proportion of zeros, with some
research using data with 20% zeros, and some datasets with as much as 90% zeros. In addition to
differently structured data with respect to zeros, there is also a difference in data sets with
regards to over-dispersion. Desjardins (2013) evaluates the performance of Zero-inflated
Negative Binomial(ZINB) and Negative BinomialHurdle (NBH) models given different levels of
disperion rate.
Most of the comparison are based on the empirical dataset, and very few studies have
used simulated data to test and compare the model performances. Lambert (1992) proposed the
Zero-inflated Poisson model and evaluated its performance using simulation studies. Miller
22
(2007) compares the Poisson, Hurdle, and Zero-inflated models under varying zero-inflation
levels and Desjardins (2013) evaluates the performance of Zero-inflated Negative
Binomial(ZINB) and Negative BinomialHurdle (NBH) models using simulation method under a
given level of zero proportion.
However, most of these previous studies that compare the models’ performance mainly
focus on model fit and parameter recovery. Although this is important, how well the models
predict different categories of zeros can also be of importance, especially for consumption
studies (to predict different types of consumers). When choosing between the Hurdle models
which assume no structural zeros, and Zero-inflated models, which allow both structural and
sampling zeros, knowing their capability to predict market segmentation will be of interest.
To the best of our knowledge, there has been no simulation studies conducted that
compare Zero-inflated and Hurdle models with both Poisson distribution and Negative Binomial
distributions. There are also no prior studies investigating how zero proportions and levels of
dispersion might affect estimations and model fit in the Hurdle models and Zero-inflated models.
Furthermore, special attention can be given to the comparison of model capabilities to predict the
correct latent classes. In the case of consumption analysis, we will compare the models’
capabilities of predicting the market structure and segmentation. The Zero-inflated model
assumes that there are two different types of zeros, and the Hurdle models assumes that there is
only one type of zero. If the Zero-inflated models can not differentiate the different types of
zeros, then its usages will be much limited. Thus, it is critical to test whether the Zero-inflated
models efficiently predict the different types of zero, especially given the various levels of zero
portions.
23
In this study, two research questions will be examined. First, under different levels of
simulation situations which include true model distribution, zero proportion, and dispersion rate,
how will these six count data models (Poisson model, Zero-inflated Poisson model, Hurdle
Poisson model, and their Negative Binomial variations) perform regarding model fit. Secondly,
we will try to explore the consequences of misspecifying the distributions. In particular, we will
pay particular attention to the comparison between Zero-inflated Models and Hurdle models, and
evaluate the proportion of correctly identified structural zeros for the Zero-inflated Models, and
test the consequences, if any, of misspecifying the latent classes for the zeros given different
levels of structural zeros.
Literature Review
Count Data and Generalized Linear Model
Count data occurs very frequently in many different fields of research, especially in the
field of social science. Count data can be used to represent the number of times that an event
occurs under a certain condition or during a certain time; for example, the number of times that
consumers purchase a certain good during a certain period would be an event count. As such, the
response values take the form of discrete non-positive integers. Hence, count data is the,
“…realization of a nonnegative integer-valued random variable” (Cameron and Travedi, 1998).
When analyzing count data, there is an assumption that the number of events is
independently identically distributed with a discrete probability distribution. The most common
probability distributions used to describe count data are the Poisson and Negative Binomial
distributions. The Poisson distribution was derived as a limiting case of the binomial distribution,
with the characteristics of mean-variance equality. The Negative Binomial distribution was
derived by Greenwood and Yule (1920) and was used as an alternative to the Poisson
distribution when the assumption of mean-variance equality is violated.
24
As for regression models, the classic linear regression model is not suitable for count data
analysis, since the assumption of normality is violated. Thus, generalized linear models, which
allow the analysis of data when the assumptions of linearity and normality are no longer met, are
employed.
The Generalized Linear Model (GLM) was first described by Nelder and Wedderburn
(1972) and has been further developed and explained by McCullagh and Nelder (1989). Instead
of modeling the mean as a linear function of the covariance in the classic linear regression, it
allows other possibilities. All GLM are specified with three components: a random component
which specifies the distribution of the output variable; a systematic component which specifies
the covariates in a linear form and a link function which connects the random component to the
systematic components. If the distribution of the output variable is normal, then the classic OLS
regression is appropriate. Besides the normal distribution, other distributions, like binomial
distributions, Poisson distribution, Negative Binomial distributions, etc. can be used.
Regarding the three components of the GLM, it is necessary to clarify the equations. The
systematic component is of the linear form of the covariates as follows in Equation 2-1:
η = 𝑥′𝛽 (2-1)
Where x𝑖 is the vector of covariance for observation i, and 𝛽 represent the corresponding
unknown parameters.
A link function connects the mean value of the output variable Y to the linear predictor η
through a function g(. ).Thus, the GLM model is expressed as Equation 2-2:
g(μ) = η = 𝑥′𝛽 (2-2)
25
Poisson Regression Model and Applications
The Poisson regression model is the most popular method for analyzing count data. It is a
specific form of the GLM which specifies the output variable Y followed by a Poisson
distribution, with the link function 𝑔(𝜇) = log (𝜇).
Thus, the probabilities of observing y𝑖 can be written as Equation 2-3:
𝑓(𝑦𝑖, μ𝑖) =µ𝑖
y𝑖𝑒−µ𝑖
y𝑖! y𝑖 = 0,1,2,3… .. (2-3)
Where μ𝑖 is parameter of the Poisson regression, which is also the mean and variance of
y𝑖 for the ith observation. Given the link function and the linear predictor η𝑖 g(μ𝑖) = log(μ𝑖) =
x𝑖′𝛽, thus we have μ𝑖 = exp (x𝑖′𝛽).
The Poisson regression has been widely used when analyzing count data. In the field of
consumption behavior, because the consumption frequency or purchase intensity is often
described as count data, Poisson regression models have been used to analyze consumers’
behavior. For example, Morland et al. (2002) used the Poisson regression when analyzing
consumers’ access to healthy food choices concerning to the distribution of food stores and food
service places. Binkley (2006) employed the Poisson regression to explore the effect of
demographic, economic, and nutrition factors on the consumption frequency of food away from
home. Cannuscio et al. (2013) using the Poisson regression to analyze the correlation between
food environment and residents’ shopping behaviors.
Problems with the Poisson Regression
Although Poisson regression models are popular when analyzing count data, the model
might not always be the best fit due to the characteristic of the Poisson regression, which
requires that the mean equals to the variance, specified by μ𝑖 = 𝐸(Y𝑖) = 𝑉𝑎𝑟(Y𝑖). The
assumption of mean-variance equality is very restrictive and easily violated. When the observed
26
variability is greater (or less) than the observed mean, then the Poisson distribution is no longer
the true realization of the data, and the data is considered to have the issue of over-dispersion
(under-dispersion). Taking the case of consumption behavior as an example, for some daily
goods like tobacco, there might be many people that choose to never consume tobacco because
they are non-smokers, yet there might be also many people that choose to consume extremely
large units per week (heavy-smokers). In this case, the data might not meet the assumption of
mean-variance equality, and the Poisson model would not be appropriate.
A special case of over-dispersion happens when there are excessive zeros in the data.
When there are abundant zero observations, the mean of the data will be closer to zero, resulting
in the violation of mean-variance equality assumption. Thus, ignoring the issue of excessive
zeros will cause biased parameter estimates and poor model fit. Using tobacco consumption as
an example again, whenever the question of consumption frequency is asked, there would be
many people who answer zero, since they are non-smokers. The same thing happens for the
consumption of food, where consumers might choose not to consume in a given period, or not
consume for reasons such as allergies or personal beliefs.
When considering the source of the excessive zeros, some research argues that the zeros
might arise from different generating processes, which is a result of unexplained population
heterogeneity (Hu et al., 2011). Generally, it is considered that the zeros can be differentiated
into two types: structural zeros, which are generated from a latent class where zero is the only
possible value, and sampling zeros, which arise from a latent class where zero happens within a
random sample of potential count responses. In the case of consumption behavior, in response to
the question “How often did you consume tobacco last month” there will be individuals who
have never smoked before (structural zeros) and individuals who are potential consumers that did
27
not choose to consume in the last month (sampling zeros). The structural zero observations are
the consumers who have a non-positive desire to consume (which can be categorized as non-
participants), and the sampling zero observations are those consumers who have positive desire
but no positive acquisition in the given period (which can be categorized as potential
consumers).
To deal with the issues of the over-dispersion and excessive zero, different models have
been used. Generally, when over-dispersion is the only issue, the Negative Binomial model will
be a better fit than Poisson regression. If only zero-inflation exists, either the Zero-inflated
Poisson model or Hurdle Poisson model are used. If both exist, then Zero-inflated Negative
Binomial and Hurdle Negative Binomial models could be used.
Negative Binomial Regression Models and Applications
When the data has the issue of over-dispersion, the Negative Binomial model is usually
considered as an alternative to the Poisson regression model, since it provides an extra parameter
to accommodate the additional variability. The Negative Binomial distribution (NB) is a gamma
mixture of the Poisson distribution. In other words, a random non-negative integer is considered
distributed as the Poisson distribution with a mean of λ, where λ is a random variable with a
gamma distribution. Thus, the NB works to allow more flexibility in accommodating variability.
For example, for the gamma distribution with shape parameter γ, and scale parameter
θ=𝜌
1−𝜌, the mass function of the Negative Binomial distribution given the gamma-Poisson
mixture written as Equation 2-4:
f(y, γ, ρ) = ∫ 𝑓𝑝𝑜𝑖(λ) ∗ 𝑓𝑔𝑎𝑚𝑚𝑎(γ,ρ)(λ)∞
0
dλ
= ∫λ𝑦
𝑦!𝑒−λ ∗ λ𝑟−1∞
0
𝑒−λ
𝜌1−𝜌
(𝜌
1−𝜌)γΓ(γ)
dλ
28
= Γ(γ+y)
𝑦!Γ(γ) 𝜌𝑦(1 − 𝜌)γ (2-4)
The standard formulation for the Negative Binomial mass function of a variable Y is
given in the following form as Equation 2-5:
f(y; k, μ) =Γ(y+k)
Γ(y+1)Γ(k) (
𝑘
𝑘+µ)𝑘(1 −
𝑘
𝑘+µ)y (2-5)
where E(Y)= μ, Var(Y)= μ +µ2
𝑘, and
1
𝑘 is defined as the dispersion parameter, and k is the
gamma scale parameter. As k increases to infinity, Var(Y) decreases to u, which is equal to E(Y),
and the distribution of Negative Binomial approaches the Poisson distribution.
For the Negative Binomial regression model, which is also a specific form of the GLM,
the link function is also the log transformation like the Poisson regression model g(μ𝑖) =
log(μ𝑖) = x𝑖′𝛽. Furthermore, as mentioned above, the Negative Binomial distribtuion converges
to the poisson distribution if k increases to infinity, thus, the Poisson regression model is nested
within the Negative Binomial regression model. As a result, the Likelihood Ratio Test or Wald
test can be used to test whether the dispersion parameter is significant.
Since Negative Binomial regression model is more flexible than Poisson regression
models accommodating data with more variability, they have also been widely used in the
analysis of consumption behavior. For example, Lesser et al. (2013) employ the Negative
Binomial model to test the association between outdoor food advertising and obesity. When
analyzing the data, the authors reject the Poisson model because of the existence of dispersion.
Han and Powell (2013) also employed the Negative Binomial model to analyze consumption
patterns of sugar-sweetened beverages in the United States.
However, although the Negative Binomialregression model can accommodate the data
with the issue of over-dispersion, it still has some limitations, especially when dealing with the
29
problem of zero-inflation. Previous research indicates that Negative Binomialregression model
does not have a good model fit for data with zero-inflation (Desjardins, 2012; Hu et al., 2011;
Lambert, 1992). Additionally, considering the potential different latent classes which generate
two types of zero, using the Negative Binomialmodel could not capture the different
characteristics of the two groups.
In the case of consumption, Negative Binomial models are very restrictive by assuming
that it is the same set of factors that influence both consumers’ decisions on participation and
consumption. Furthermore, both Poisson regression models and Negative Binomial regression
models assume that the characteristics of consumers and non-consumers have no significant
difference, thus fail to fully identify different consumer types. To investigate the different types
of zero (and consumers), a mixture model or a two-part model may improve fit.
Zero-inflated Models and Applications
Zero-inflated models refer to the models that define a mixture of two different
distributions of zeros, and are able to accommodate the issue of excessive zeros in count data.
Zero-inflated models assume that there are different latent classes in the population. Thus, the
zero observations could be generated through two different sources: “sampling” and “structural”
zeros. When applying the Zero-inflated model to the case of food consumption, observed zero
consumption will be recorded when the consumer is a genuine non-participant (Structural zero),
or when the consumers are potential consumers, and choose zero consumption as the corner
solution of a standard consumer demand problem (sampling zero). Thus, using the Zero-inflated
models will allow us to predict the existence of three different groups: genuine non-participant;
potential consumers; and active consumers with positive consumption.
30
Zero-inflated models have been developed for different models, including Poisson
regression models (Lambert, 1998), Negative Binomial regression models (Ridout, Hinde, and
Demetrio, 2001), and other models (i.e., geometric models (Mullahy, 1986)).
The Zero-inflated Poisson (ZIP) model was proposed by Lambert in 1992. It assumes a
mixture of two distributions at the point of zero: a Poisson distribution and a binomial
distribution. According to this assumption, it is assumed that with probability p, the only possible
observation is 0 (structural zero), and with probability (1-p), a Poisson random variable is
observed. The probability mass function of a ZIP model is as follows in Equation 2-6:
Pr (Y=y) ={𝑝 + (1 − 𝑝) exp(−λ) 𝐼(𝑦=0)
(1 − 𝑝)λy𝑒−λ
y! 𝐼(𝑦>0)
(2-6)
Thus, from the above Equation 2-6, zero observations can be observed from two parts:
the structural point mass component, p, and from the sampling Poisson component, (1 −
𝑝) exp(−λ). In the ZIP model, E(Y)= μ = (1 − 𝑝)λ, and Var(Y)= μ +𝑝
1−𝑝μ2.
The ZIP model is also a special case of GLM, with a logit link function for p, and log link
function for λ as follows (Equation 2-7- Equation 2-8):
Logit (p)=Log(𝑝
1−𝑝)=𝑥′𝛽 (2-7)
Log(λ) = z′𝛼 (2-8)
Where 𝑥 are the covariates for the first stage, with 𝛽 as the corresponding estimates; z are
the covariates for the second stage, with 𝛼 as the corresponding estimate. Furthermore, there is
no requirement that x=z.
The Zero-inflated Poisson model has been widely used when dealing with excessive
zeros, and there are many examples in the analysis of consumption as wee. For example, Almasi
et al. (2016) employed the ZIP model to analyze the effects of nutritional habits on dental care
31
among schoolchildren and Matheson et al. (2012) explored the influence of gender on alcohol
consumption using the ZIP model.
Additionally, Lambert (1992) extends the ZIP model to the ZIP(τ) model which allows p
and λ to be correlated with a shape parameter τ. Huang and Chin (2010) employed the ZIP(τ) to
model road traffic crashes and Calsyn et al. (2009) explored the correlation with motivational
and skill training and HIV risk using the ZIP(τ).
Just as the Poisson regression model was extended to the Negative Binomial regression
model, the Zero-inflated Poisson regression model can be extended to the Zero-inflated Negative
Binomial(ZINB) regression model as well. Even without zero-inflation, it is also possible that
the data over-dispersion happens because of greater variability of the non-zero outcomes. In this
case, instead of the ZIP model, the ZINB model is a better fit for the data.
Similar to the ZIP distribution, the ZINB distribution assumes that there is a mixture
distribution at the point of zero: a Negative Binomial distribution and a Binomial distribution.
Thus, the ZINB can be expressed as follows in Equation 2-9:
Pr (Y=y)={𝑝 + (1 − 𝑝)(
𝑘
𝑘+µ)𝑘 𝐼(𝑦=0)
(1 − 𝑝)Γ(y+k)
Γ(y+1)Γ(k) (
𝑘
𝑘+µ)𝑘(1 −
𝑘
𝑘+µ)y 𝐼(𝑦>0)
(2-9)
Where μ is the mean of the NB distribution, and 1
𝑘 is the dispersion parameter. Thus the
mean and variance of the ZINB distribution is E(Y)= (1 − 𝑝)μ,Var(Y)= (1 − 𝑝) ∗ μ ∗ (1 +µ
𝑘+
𝑝μ). Just as the NB distribution converges to the Poisson distribution as k increases to infinity,
the ZINB distribution also converges to the ZIP distribution as k increases.
There has been much research conducted employing the ZINB model. Examples of the
use of the ZINB model in consumption include Hendrix and Haggard (2015), who employed the
32
ZINB model to analyze global food prices and regime type in the developing world. Moralies
and Higuchi (2017) employed the ZINB model to explore the effectiveness of consumers’belief
about health and nutrition on their willingness to pay for fish.
Hurdle Model and Applications
The Hurdle model was first developed by Cragg (1971) as an example of truncated
models, relaxing the Tobit model by allowing separate stochastic processes for the observed zero
and positive outcomes (Yen and Huang, 1996). Different from the Zero-inflated models, the
Hurdle models are no longer a mixture of different models, but a two-part model. The first part
predicts whether the outcome is zero or not, and the second part generates the non-zero counts.
Thus, it assumes that all the zeros are from the first stage.
When modeling consumption behavior using the Hurdle count data model, there is an
assumption that individuals need to pass two stages before being observed with a positive level
of consumption: a participation decision and a consumption decision. In the first stage, the
consumer makes a decision on whether or not to participate. In the second stage, a decision on
how much/many to purchase is determined. Specifically, the Hurdle model assumes that the
participation stage dominates the consumption stage. Thus, if the consumers choose to
participate in the first step, it does not allow zero-consumption in the consumption stage.
The Hurdle model uses a binomial logistic regression model to indicate whether a count
is zero or positive (Green, 1994). If a positive outcome is realized, then a truncated at zero count
data model (Poisson/NB) is used for the positive counts. However, the first part does not have to
be the binomial logistic regression model, “there will likely exist numerous plausible
specifications of both the binary probability model and the conditional distribution of the
positives” (Mullahy, 1986). For example, in Mullahy’s research (1986), he used the Poisson
33
distribution governing the probability of observing a zero count. Thus, a generic Hurdle model is
as follows in Equation 2-10:
Pr (Y=y) ={𝑔1(0) 𝐼(𝑦=0)
(1 − 𝑔1(0)) ∗𝑔2(𝑦)
1−𝑔2(0) 𝐼(𝑦=1,2,3….)
(2-10)
Where Y is the outcome variable, 𝑔1 is the specification of the binary probability model
that governs the first Hurdle, indicating whether the outcome is zero; and 𝑔2 is the specification
of the trucated-at-zero probability generating the positive values.
There are also some popular specifications for 𝑔1 and 𝑔2, for example Green (1994)
specified the 𝑔1 as a binomial distribution and 𝑔2 as a truncated-at-zero Poisson distribution,
which provides the following form in Equation 2-11:
Pr (Y=y) = {
𝑝 𝐼(𝑦=0)
(1 − 𝑝) ∗λy𝑒−λ
(1−𝑒−λ)y! 𝐼(𝑦=1,2,3….)
(2-11)
Where p is the probability of a count being observed as zero, and λ is the parameter for
the truncated Poisson distribution. To be more specific, the link function for p is logit
transformation where Logit (p)=Log(𝑝
1−𝑝)=𝑥′𝛽, and the link function for λ is log, with Log(λ) =
z′𝛼. 𝑥 are the covariates for the first stage, with 𝛽 as the corresponding estimates; z are the
covariates for the second stage, with 𝛼 as the corresponding estimate. Furthermore, this is no
requirement that x=z.
In another example, Mullay (1986) specified both 𝑔1 and 𝑔2 as Poisson distributions
which provides the following specifications as Equation 2-12:
Pr (Y=y) ={𝑒−λ1 𝐼(𝑦=0)
(1 − 𝑒−λ1) ∗λ2
y𝑒−λ2
(1−𝑒−λ2)y! 𝐼(𝑦=1,2,3….)
(2-12)
34
Where λ1 is the parameter for the Poisson distribution governing the first Hurdle; λ2 is
the parameter for the Poisson distribution generating the positives. Both λ1 and λ2 could be
parameterized with log link function as Log(λ1) = 𝑥′𝛽, and Log(λ2) = z′𝛼.
Shonkwiler and Shaw (1996) extended Mullahy’s specification by allowing zero
observations in both the first and second stage. Thus, in Shonkwiler and Shaw’s model (Double
Hurdle count-data model1), there are two mechanisms generating zero observations: zero
observations could either happen in the first stage by choosing not consume or in the second
stage by choosing to consume zero frequency.The essence of the double Hurdle count data
model is very similar to the ZIP model, but with the first part indicating the structural zero using
a Poisson distribution specification instead of a binomial. The specification for the double-
Hurdle count data model is as follows in Equation 2-13:
Pr (Y=y)={𝑒−λ1 + (1 − 𝑒−λ1) ∗ 𝑒−λ2 𝐼(𝑦=0)
(1 − 𝑒−λ1) ∗ (1 − 𝑒−λ2)λ2
y𝑒−λ2
(1−𝑒−λ2)y! 𝐼(𝑦=1,2,3….)
= {𝑒−λ1 + (1 − 𝑒−λ1) ∗ 𝑒−λ2 𝐼(𝑦=0)
(1 − 𝑒−λ1) ∗λ2
y𝑒−λ2
y! 𝐼(𝑦=1,2,3….)
(2-13)
Where λ1 is the parameter for the Poisson distribution governing the first part, indicating
whether the zeros are structural zeros or not; λ2 is the parameter for the Poisson distribution for
the second part. Both λ1 and λ2 could be parameterized with log link function as Log(λ1) = 𝑥′𝛽
and Log(λ2) = z′𝛼. If we let p = 𝑒−λ1, then this model specification is the same as the ZIP
model.
1 The term borrowed from Shonkwiler and Shaw (1996)
35
The Poisson regression model can be extended to NB regression model, and the ZIP
model can be extended to ZINB model, the Hurdle Poisson regression model can be extended to
the Hurdle NB model as well. There have been many studies using Hurdle models ( Hu et al.,
2001; Bandyopadhyay et al., 2011; Bethell et al., 2010; Rose et al. 2006). Examples of research
using Hurdle models include Crowley, Eakins and Jordan (2012), who employed the double-
Hurdle model to analyze the lottery participation and expenditure; Jaunky and Ramchurn (2014),
who analyzed consumer behavior in the scratch card market; Jiang et al. (2012) modeled
mushroom consumption, and Bezu and Kassie (2014), estimated maize planting decisions.
Comparison of the Models
In this section, there are six count data models listed, including the Poisson, NB, ZIP,
ZINB, Hurdle Poisson (PH), and Hurdle NB (NBH) models. When models are nested within one
another, a Wald/LR test can be used to test the significance of these extra parameters. For
example, the Poisson regression model is nested within the NB, the Poisson Hurdle within the
NB Hurdle; ZIP within ZINB; and PH within the NBH. Besides these, other pairs of the models
are not inherently nested within each other. If models are not nested, they can be compared using
Vuong’s test (Vuong, 1989), Akaike Information Criterion (AIC), and Bayesian Information
Criterion (BIC).
Prior research has compared some models, such as the NB and Poisson regression models,
where research, as mentioned above, showed the NB better handles the problem of over-
dispersion (Atkins and Gallops, 2007; Warton, 2005). When the dispersion is not present,
according to Warton’s research (2005), Poisson regression models perform better than NB
models.
As for the comparisons between NB and Zero-inflated models, Lambert (1992) compared
the ZIP model to the Poisson and NB model when the ZIP model was first proposed. The
36
conclusion was that the ZIP outperformed the other two models, and the NB performs better than
the Poisson model in term of the prediction. Green (1994) compared the NB, ZIP, and ZINB
models. Based on Vuong’s test statistics, he found that the ZINB model performs the best,
followed by the NB, ZIP and Poisson models. A possible reason for this result may be because
the ZINB model could accommodate two sources of dispersion, and in the data used in this
study, the dispersion was caused mostly by unobserved response heterogeneity. This would lead
the NB model to perform better than the ZIP model. Desouhant et al. (1998) compared the NB
and ZIP models and found that the two models perform roughly similar. They conclude that
researchers need to accommodate both over-dispersion and zero-inflation in the analysis of count
data. Slymen et al. (2006) compared the ZIP, ZINB, NB, and Poisson models and found the NB
model fit better than the Poisson models. However, the ZINB and ZIP models performed nearly
the same both regarding model fit and parameter estimates, which indicates that the main issue of
the data in this study was zero-inflation, and dispersion was likely not severe in this case.
Wenger and Freeman (2008) compared the ZIP, ZINB, NB, Poisson and concluded that the
Zero-inflated models perform better than the non-inflated models and that NB formulation
models fit better than other models without the NB formulation.
The comparisons between Zero-inflated models and Hurdle models has attrated more
attention. The focus of comparison of these two models has been the sources of zero
observations. As discussed above, the Hurdle models assume there only exists one type of zero
observations, yet the Zero-inflated models assume that zero observations are coming from two
different sources. A second difference of these two models are their capability of handling the
data with zero deflation. Zero-inflated models are typically used to analyze data with zero-
inflation and have a poor fit for data with under-dispersion of zero counts, while Hurdle models
37
have better fit dealing with zero-deflation. Min and Agresti (2005) compared ZIP model with
Hurdle Poisson model using simulation study, and found that ZIP model had very poor estimate
capability when the data has the issue of zero-deflation while the Hurdle model did not.Based on
this study, it is indicated that the Hurdle models might be more general than the Zero-inflated
models. Desjardins (2013) compared the ZINB and HNB models using simulations and found
that the HNB performs better than the ZINB regarding both model fit and parameter recovery.
Gaps and Shortcomings
Although there has been much research comparing models, there have been few studies
comparing and evaluating model performance using simulation. Beyound the study by Lambert
(1992), Min and Agresti (2005) , Miller (2007) and Desjardins (2013), no one has compared the
performance of the count data models using the simulation data instead of the empirical data;
Furthermore, no one has considered the comparison of all the six count data models using
simulation study given different levels of zero proportion and over-dispersion. Next, when
compared the models, the previous study focuses mainly on model fit, without consideration of
their capability of prediction. Especially, when comparing the Hurdle models with Zero-inflated
models, their capability of capture zero observations was rarely considered.
Thus, how the different count data models perform given different levels of zero
proportion and dispersion is an area that needs further investigation. From the previous empirical
research, it can inferred that if the effect of dispersion is much larger than the effect of zero-
inflation, the NB model should perform better than the ZIP model. The question remains though,
how does the effect of zero-inflation change given different levels of dispersion? How would the
model performance change based on different levels of zero proportion given different levels of
dispersion?
38
Furthermore, when analyzing consumption behavior, it is of great interest in analyzing
different consumer types and exploring market segmentation. In this sense, in addition to the
model fit, it is also very important to comapre the models’ prediction capabilities.To my
knowledge, there has been no prior research focusing on the comparison of models’prediction
capabilities, especially when comparing Zero-inflated models to Hurdle models, which have very
different approaches to classifying zeros. An interesting point would be to determine if the Zero-
inflated model could efficiently predict the correct portion of non-consumers (structural zeros),
which is the most important utility of the Zero-inflated models. With one more step, if we allow
different levels of (structural) zeros in the Zero-inflated models, how would the performance of
the two models change?
Method
Research Questions
Based on the literature review, Two research questions will be exmained in this study.
First, under different levels of simulation, including zero proportion, and dispersion rate, how
will the six count data models (Poisson model, Zero-inflated Poisson model, Hurdle Poisson
model, and their Negative Binomial variations) perform regarding model fit. Second, we will
explore the consequences of misspecifying the distributions. In particular, we will pay special
attention to the comparison between Zero-inflated Models and Hurdle models, and evaluate the
proportion of correctly identified structural zeros for the Zero-inflated Models, and test the
consequences, if any, of misspecifying the latent classes for the zeros given different levels of
structural zeros.
To answer these research questions, a simulation study was conducted under different
scenarios. We will generate datasets from four different distributions: Zero-inflated Poisson
distribution (𝜌, 𝜇) (where 𝜌 is the proportion of structural zeros, and 𝜇 is the mean of the
39
Poisson); Hurdle Poisson (𝜋, γ); Zero-inflated Negative Binomial distribution ((𝜌, 𝜇, 𝑘) where 𝜌
is the proportion of structural zeros, 𝜇 is the mean of the Poisson, and k is the dispersion rate);
and Hurdle Negative Binomial distribution (𝜋, γ, k). After generation, we will fit each dataset
with each of the six different count-data models to compare their performances under different
true and untrue model specifications.
The simulation conditions controlled in this experiment include different levels of zero
(structural zero) proportion, and various levels of dispersion rate. To be more specific, the
zero/structural zero percentages (𝜌) will be set at different levels, and the levels of dispersion (k)
will also be controlled at different levels and compare each model’s capabilities of capturing the
zero observations, and structural zero observations. With the purpose of evaluating model
performances, the model fit, prediction bias, and proportion of correctly identified structural and
sampling zeros will be recorded and compared. What is more, the consequences of fitting a
model to a mis-specified distribution will be evaluated with special attention.
Monte Carlo Simulation
As discussed in the previous section, the generalized linear model was constructed by a
systematic component, a random component and a link function.The case model for the Poisson
regression assumes that 𝑦1, 𝑦2…𝑦𝑛 are independently, identically distritribued as follows:
𝑌𝑖~Poisson (𝜃𝑖)
Where the link function is:
log(𝜃𝑖)= 𝛽0+𝛽1 ∗ (𝑥1𝑖)+ 𝛽2 ∗ (𝑥2𝑖) (2-14)
Similarly, the negative binomial formulation of the Poisson model is the same as the
Poisson regression but with an extra parameter of dispersion.
40
The case model for zero-inflated Poisson model assumes that 𝑦1, 𝑦2…𝑦𝑛 are
independently, identically distritribued as follows:
𝑌𝑖~ZIP (𝑝𝑖, 𝜃𝑖)
Where the link function is:
Logit (𝑝𝑖)=Log(𝑝𝑖
1−𝑝𝑖)=𝛼0+𝛼1 ∗ (𝑧1𝑖)+ 𝛼2 ∗ (𝑧2𝑖) (2-15)
log(𝜃𝑖)= 𝛽0+𝛽1 ∗ (𝑥1𝑖)+ 𝛽2 ∗ (𝑥2𝑖) (2-16)
The zero-inflated negative binomial model is similar to the ZIP model, with an extra
dispersion parameter 𝑌𝑖~ZINB (𝑝𝑖, 𝜃𝑖 , 𝑘−1). The link function of ZINB model is the same as the
ZIP model.
The last set of models is the Hurdle models. The Hurdle Poisson regression model
assumes that 𝑦1, 𝑦2…𝑦𝑛 are independent and identically distributed as the distribution:
𝑌𝑖~HP (𝜋𝑖 , 𝜃𝑖)
Where the link function is:
Logit (𝜋𝑖)=Log(𝜋𝑖
1−𝜋𝑖)=𝛼0+𝛼1 ∗ (𝑧1𝑖)+ 𝛼2 ∗ (𝑧2𝑖) (2-17)
log(𝜃𝑖)= 𝛽0+𝛽1 ∗ (𝑥1𝑖)+ 𝛽2 ∗ (𝑥2𝑖) (2-18)
Similarly, the Hurdle Negative Binomial model has the same link function as the HP
model, but has one addition parameter for dispersion.
Simulation Study Design
The simulation study was designed to examine the performance of the six count data
models under different sets of simulation conditions, and in which conditions these models have
similar or dissimilar performance. In particular, in this experiment, we will evaluate the
performance of the models regarding model fit and capabilities of predicting zero observations.
What is more, the experiment is also designed to explore the consequences of fitting a wrong
41
model to a pre-specified distribution, and specifically, this experiment will give attention to the
consequences of fitting a Zero-inflated model to a Hurdle-model distribution and vice-versa.
To be more specific, in this experiment, datasets will be generated from the following
four distributions: Zero-inflated Poisson, Zero-inflated Negative Binomial, Hurdle Poisson, and
Hurdle Negative Binomial. For each distribution, the zero/structural zero proportion is generated
from a binomial process controlled by different levels of p values, and the counting process
(Poisson/Negative Binomial) will be set given known coefficients. Particularly, if the parameter
p (proportion of structural zeros) in ZIP is 0, then the distribution would be a Poisson/truncated
Poisson distribution, similarly, if the parameter p in ZINB is 0, then the distribution would be a
Negative Binomial/ Truncated Negative Binomial distribution. What is more, for the Negative
Binomial formulations, the experiment will also control different levels of dispersion to compare
model performances under different situations. Once the dataset is generated, six different count
data models will be fit for each dataset, and we will compare model performances based on the
model fit and capability to capture zero and structural zeros. We will also evaluate the capability
of coefficient recovery (for the counting process), and relative/absolute bias.
Regarding simulation conditions, the following scenarios will be considered in this study:
distributions, model types, levels of dispersion, and varying levels of zero/structural zero
proportion. In total, there were 4 distributions (ZIP, ZINB, HP and HNB), 6 models (Poisson,
NB, ZIP, ZINB, HP, HNB), 4 levels of dispersion (0.5, 1, 2, 4) and 5 levels of (structural) zero
proportion (0.1, 0.3, 0.5, 0.7, 0.9). Here, the level of zero proportion indicates the total zero
proportion for Hurdle model distributions and the structural-zero proportion for Zero-inflated
model distributions. What is more, the dispersion level will only exist for Negative Binomial
formulations. In total, this results in 4*5*(3+3*4)=300 different scenarios.
42
Sample size is another important concern when analyzing different models. When the
sample size is too small, results may not be consistent since it is not valid to assume it is
asymptotically normal, however, when the sample size is too large, the computation time is
significantly increased. Based on previous research, Lambert (1992) considered sample sizes of
25, 50, and 100, but singularities and non-convergence occurred in the experiment. Particularly,
when the sample size is small (like 25 or 50), the situation of near perfect discrimination (when
there is a hyperplane that divides all 0s on one side and all the 1s on the other side) is more likely
to happen. Thus to avoid these issues, in the experiment, the sample size was set to be 250 in all
cases.
The simulation size is also important considering the validation of simulation results. If
there are too few replications, results may not be consistent, and as the size of simulation is
increased, the consistency of results would also increase. In this study, the simulation size is set
to be 1,000, similar to Civettini and Hines’s (2005) research which analyzed model
misspecification with the ZINB model.
The “glm” procedure in R was used for the Poisson and Negative Binomial Regression
analysis, the “pscl” library in R (Zeileis et al.,2008) was used for the Zero-inflated Poisson
model, Zero-inflated Negative Binomial model, and Hurdle Poisson model analysis, and the
“actuar” library in R (Dutang et al.,2008) was used for the Hurdle Negative Binomial model
analysis. For each model regressed on each dataset, results including Loglikelihood and AIC
statistics, relative bias of E(Y|X) and predictions of structural zero/zero are saved for the further
analysis.
43
Model Evaluation of the Simulation Studies
To assess the performance of the six different models under different scenarios, we will
employ various measures related to model fit and their capabilities of predicting zero
observations. Specifically, for model fit, we will employ the Loglikelihood statistics, AIC
statistics, and the relative bias for the E (Y|X).
The AIC statistics is defined as 2k-2log(L),where k is the number of parameters in the
model, and L is the likelihood of the model. In general, models with the lowest AIC are favored.
In this analysis, since the relationship of the six count models are not all nested. Thus, the AIC
statistics are utilized in this analysis.
Besides the Loglikelihood and AIC statistics, similar like what Lambert (1992) did in her
study, we will also employ the relative bias for the E(Y|X) to examine the model fit. The relative
bias for the E (Y|X) is defined as absolute value of [𝐸(�̂�|𝑋) − E(Y|X)]/E(Y|X) .For each
simulation, the average of the relative bias was calculated, and then aggregated across all the
simulations. In particular, in this analysis, in order to evaluate the prediction more accurately, the
10-fold cross-validation was employed.
As for the capabilities of predicting zero observations, their predictions of zero and
structural zero will be saved in each scenario. While the Zero-inflated models will be able to
predict both the structural and sampling zero observations, thus results will be recorded for their
predictions for both two types of zero. Considering the models’ attributes, the Hurdle model will
only be able to predict one type of zeros, thus results will be only recorded for their predictions
for the zero observations. What is more, to better assess the prediction of zero, the 10-fold cross
validation method was employed in this analysis.
44
Since we generate dataset from four different distributions, and for each distribution, we
will run all the six count data models, thus we would give special attention to the performances
of the models as a consequence of using a wrong model given a true distribution, and
particularly, we will be very interested in exploring the capabilities of Hurdle models to capture
zero when the true model is Zero-inflated model, and vice versa.
Data Generating Mechanism
Data for ZIP/ZINB were generated using a similar procedure to what was employed in
Lambert (1992), but the proportion of structural zero p was controlled in this experiment.The
data-generating mechanism is as follows:
Calculate 𝜃𝑖 based on the specifications of the coefficients of count process.
Generate a Uniform (0,1) random vector U of length n
If 𝑈𝑖 ≤ p (the given level of structual zero), then 𝑦𝑖=0, otherwise 𝑦𝑖~ Poisson ( 𝜃𝑖)/ 𝑦𝑖~
NB (𝜃𝑖 , 𝑘) (k is the given level of dispersion)
The data-generating mechanism for the Hurdle Poisson/Hurdle NB distribution is as
follows:
Calculate 𝜃𝑖 based on the specifications of the coefficients of count process.
Generate a Uniform (0,1) random vector U of length n
If 𝑈𝑖 ≤ p (the given level of structual zero), then 𝑦𝑖=0, otherwise 𝑦𝑖~ Zero-Trucatedd
poisson (𝜃𝑖)/ 𝑦𝑖~ Zero-Trucated NB (𝜃𝑖 , 𝑘) (k is the given level of dispersion)
Results
Pseudo-Population Zero-inflated Poisson Model
At first, we focus on the results when the pseudo-population is Zero-inflated Poisson Model.
The table of convergence rate under the five different levels of structural zero is displayed below
in Table 2-1. We could see that as the zero proportion gets larger, the convergence rate is getting
45
smaller. When the structural zero proportion gets to be as high as 90%, the convergence rate is
only 86.1%.
Model fit
The Loglikelihood statistics for the six different models at different levels of zero
proportion when the true model is ZIP are displayed in Table 2-2. The AIC statistics for the
different models at different levels of zero proportion when the true model is ZIP are displayed
in Table 2-3. When the proportion of structural zeros is 10% or 30%, the Zero-inflated Negative
Binomial (ZINB) regression model has the lowest AIC rather than the true ZIP model. However,
the true ZIP model has the best fit (because the ZINB model has one extra parameter (dispersion
parameter)). What is more, we also find that when the zero proportion gets larger (50%, 70% and
90%), the true ZIP model has the lowest LL. When the proportion of zeros is less than 90%, the
ZINB model is the best among the five alternative models. When the zero percentage increases
to 90%, the HP model is the best alternative model given both the LL and AIC statistics.
Another finding is that as the zero proportion increases, the model fit for the Zero-inflated
and Hurdle models improve, yet the model fit of Poisson and Negative Binomial regressions get
worse. This is an indicator that the Poisson and Negative Binomial models struggle to handle the
issue of zero-inflation.
Relative bias of E(Y|X)
The relative bias for E(Y|X) in the case when the true distribution is Zero-inflated Poisson
distribution is displayed in Table 2-4. Similar to the model fit statistics, we observe that when the
true model is ZIP, the ZINB model performs well in predicting Y. When the percentage of
structural zeros is smaller than 50%, the true ZIP model has the best performance in predicting
Y, yet when the percentage of structural zeros is greater than 50%, the mis-specified ZINB
model is better than the true model. Across all the different levels of zero proportion, the average
46
of the relative bias of the ZIP model is -0.43, and the average of relative bias of the ZINB model
is less, at -0.35.. The Poisson and Negative Binomial (NB) models have significant bias in
predicting Y (relative value larger than 1). What is more, the Hurdle models are the worst model
regarding the relative bias in this case (-1.77).
When comparing the results row by row, as the proportion of structural zeros increases, the
relative bias of the true model ZIP increases, and the relative bias of the mis-specified ZINB
model also increases. When the proportion of structural zeros reaches 90%, the relative bias for
all of the six models are large, including the true ZIP model.
Abilities of capturing zero observation
Another important feature is whether or not Zero-inflated models are capable to predict
the zero observations and structural zero observations. The observed zeros in each dataset,
together with the predicted zero observations and structural zero observations from different
models are displayed for different proportions of structural zeros in Table 2-5. When the
proportion of structural zeros (p) is equal to 0.1, the mean of the observed zero observations is
83. For the six models, both the Zero-inflated and Hurdle models capture the zero observations
accurately. The Hurdle models predict the zero observations exactly equal to the observed zeros,
due to the models’ attributes. Both the ZIP and ZINB models predict the number of zero
observations is 84, only one unit difference from the observed number. At the same time, neither
the Poisson nor Negative Binomial (NB) models capture enough zero observations. As p
increases to 0.7, both the Hurdle models and Zero-inflated models continue to predict the zero-
observations very accurately.
When only focusing on the prediction of structural zeros, given that Hurdle models only
allow one type of zero, they are unable to identify the different types of zero. Zero-inflated
models allow zero observations to come from two separate processes, allowing the
47
differentiation of the zeros. Given n=250, the expected number of structural zeros is 250*p.
When p=0.1, 0.3, 0.5, 0.7 and 0.9, the expected number of structural zeros are 25, 75, 125, 175,
and 225, respectively. Both the ZIP and ZINB model provide the same results when p is 0.1, 0.3,
and 0.5. When p is 0.1, the ZIP and ZINB overestimate the percentage of structural zeros, and
when p=0.3 and 0.5, they predict the structural zeros accurately. When p increases to 0.7 and 0.9,
the true ZIP model still provides accurate prediction, while the ZINB model underestimates the
structural zero percentage.
From this experiment, in the case where the true model is the Zero-inflated Poisson
regression model, we find the Zero-inflated models have powerful capabilities to predict both
zero observations and structural zero observations. Because of the models’ design, the Hurdle
models can always estimate the zero observations accurately, yet they fail to capture the existing
structural zero observations. Thus, if the research has underlying assumptions regarding the
existence of structural zeros, the Zero-inflated models should be considered over Hurdle models.
When comparing the Zero-inflated models in more detail, results indicate that the ZIP
predicts structural zeros very accurately when there exists a significant portion of structural
zeros, while when the proportion of structural zeros is comparatively small, it might overestimate
the percentage. The misspecified ZINB model tends to underestimate the structural zeros when
the percentage of structural zeros increases.Both the Poisson and Negative Binomial regression
models fail to capture abundant zero observations, model fit decreases as zero observations
increase. Thus, when faced with zero-inflation, the modified Poisson regressions should be
considered.
Pseudo-Population Hurdle Poisson Model
Next, we consider the case when the pseudo-population is the Hurdle Poisson Model.
Convergence rates under the five different levels of zero proportion are shown in Table 2-6. As
48
the proportion of zeros increases (up to 70% zeros), the convergence rate decreases. When the
proportion of zeros approaches 70%, the convergence rate reaches a low of 71.1%.
Model fit
The log likelihood statistics for the six different models with different proportions of zeros is
displayed in Table 2-7. The true Hurdle Poisson (HP) model has the lowest log-likelihood when
the proportion of zeros is 10% and 90%. At other proportions of zero, the Hurdle Negative
Binomial (HNB) model has the lowest log-likelihood rather than the true model.
However, when focuing on the AIC statistics (Table 2-8), the true HP model has the best
model fit, with the HNB model being the best among the remaining five misspecified models
when the p values are relatively small. When p increases to 90%, the ZIP model becomes the
second best model in terms of the model fit. In this scenario, the Poisson regression model has
the worst model fit across all the five proportions of zero. Additionally, the model fit of the true
HP model and the Zero-inflated models improve when the proportion of zeros increases.
Relative bias of E(Y|X)
The relative bias for E(Y|X) in the case when the true distribution is Hurdle Poisson is
shown in Table 2-9, with each row indicating a different scenario of zero-proportion.
Results of examining relative bias indicate that when the true model is HP, the HNB model
is the best alternative model in predicting Y when the proportion of zeros is small (less than
70%). However, when the proportion of zeros increases to 90%, neither the HP nor the HNB
have good model prediction, and the relative bias increases over 100%. In this situation, the ZIP
model has the lowest bias (only when the proportion of zeros is 90%). Because the prediction for
the HP model is so poor when zeros make up 90% of the data, the overall average of relative bias
is larger than that of the ZIP model. The results also indicate that both the Poisson and NB
models have a significant bias in predicting Y.
49
Abilities of capturing zero observation
In addition to model fit, we are concerned with the capability of the models to correctly
predict the zero observations. Because the datasets are generated from the Hurdle Poisson
models, which only has one single process of generating zeros, we only compare the six models’
prediction of zero observations.
The observed and predicted zeros for the six different models are displayed in Table 2-
10. Both the HP and HNB exactly predict the zeros. However, when we turn to the prediction of
the Zero-inflated models, when the proportion of zero observations is 10%, 30%, 50% and 70%,
both the ZIP and ZINB models significantly overestimate the zero observations. When the p-
value is 90%, the ZIP and ZINB models predict the zero observations very accurately, the same
as the observed number. Another finding is that as the proportion of zeros increase, the
difference between the predictions of the ZIP and ZINB from the true zero observations
decreases, which indicates that when the proportion of zeros incease, the bias of the Zero-inflated
models of predicting zero observations decreases. There were no differences between the ZIP
model and ZINB model in this case regarding their capabilities of predicting zero.
As the Hurdle models can predict the zero observations accurately when the true model is
HP, and the Zero-inflated models (both ZIP and ZINB) overestimate the proportion of zeros
(especially when the proportion of zero is small), in research, if the researcher believes that the
true model is Hurdle Poisson distribution, using Zero-inflated models, could cause significant
bias.
Pseudo-Population Zero-inflated Negative Binomial Model
Next, we analyze the results when the pseudo-population is the Zero-inflated Negative
Binomial (ZINB) Model. Different from the previous two scenarios, the Negative Binomial
50
distribution will have one more condition to be controlled – the dispersion rate. Thus, in this
case, both the proportion of zeros and the levels of dispersion will be controlled and analyzed.
The convergence rate for the five different levels of structural zeros and four levels of
dispersion are displayed in Table 2-11.As the proportion of structural zeros increases, the
convergence rate decreases. Additionally, as the dispersion rate increases, the convergence rate
improves, which indicates that models with larger variance (larger dispersion rate) will converge
more easily.
Model fit
The Loglikelihood and AIC statistics for the six models when the data is a ZINB distribution
with dispersion rate of 0.5 are shown in Table 2-12 and Table 2-13. In this case, for all five
proportions of structural zeros, the true ZINB regression model has the lowest log-likelihood,
with the Hurdle Negative Binomial model as the best alternative. When the true model has the
Negative Binomial formulation, both the ZIP and HP models have poor model fit, which
indicates that the Poisson distribution does not handle data with dispersion well. Thus a pre-test
for dispersion should be conducted before selecting models.
The Log-likelihood and AIC statistics are displayed for the situations when the dispersion
rate is equal to 1, 2 and 4 in Table 2-14 to Table 2-19. Similar results were found in each of the
scenarios that the true ZINB model has the best model fit over all the different cases, and the
HNB model is the best alternative to the misspecified models. What is more, over all of the
different situations, the models are improve when the percentage of structural zeros increases.
Examining the Loglikelihood and AIC statistics for various levels of dispersion rates case
by case shows that as the dispersion rate increases, the model fit improves for both the ZINB and
HNB models. Thus, we find that the ZINB and HNB model can handle data well when there
exists the issue of the zero-inflation and dispersion. The average AIC of the ZINB model is
51
displayed in Table 2-20, showing that the model fit of ZINB improves as the proportion of zeros
increases, while the model fit decreases as the dispersion rate increases.
Relative bias of E(Y|X)
The relative bias for E(Y|X) in the case when the true distribution is the Zero-inflated
Negative Binomial distribution with dispersion rate=0.5 is shown in Table 2-21. Each row in the
table indicates a different scenario of zero-proportion.
When the true model is ZINB, the ZIP model is the best alternative model in predicting
Y, which is different from the results for AIC statistics. When the percentage of structural zeros
is 10% , the true ZINB model has the best prediction, yet when the zero proportion increases to
30%, 50% and 90%, the misspecified ZIP model has better prediction than the ZINB model.
As the proportion of zeros increases, the relative bias of the true ZINB ZINB increases.
The results indicate that the Hurdle models have a significant bias, in this case, the average of the
relative bias of both the HP and HNB models are substantially above 100%.
The relative bias when the true model is ZINB, and dispersion rate is 1 ,2 and 4 are
shown in Table 2-22 to Table 2-24. Similar to when the dispersion rate is 0.5, the ZINB model
has the best prediction of Y when the percentage of zeros is relatively small, while when the zero
proportion increases to 90%, the misspecified ZIP model has better prediction than the true
model. Using results from all dispersion levels, we findthat the ZINB model tends to have a
relatively better prediction as the dispersion rate increases. What is more, across all senarios,
results show that the Hurdle models have poor prediction of Y when the true model is ZINB.
The average relative bias of the ZINB model at different combinations of zero proportion
and dispersion rate are shown in Table 2-25. As the dispersion rate increases, the prediction of
the true ZINB model improves, especially when the proportion of zeros is smaller (10%, 30%,
52
50% and 70%). However, when the proportion of zeros is 90%, the ZINB model has good
prediction when D=0.5, and as D increases, the prediction is gets worse.
Abilities of capturing zero observation
The predicted zero observations and structural zero observations from each of the six models
given different proportions of structural zero observations and dispersion rates when the true
model is ZINB are shown in Tables 2-26 –2-29. When the dispersion rate is 0.5 and the
percentage of structural zeros is 0.1, the observed zero observations have a mean of 120.
Comparing the six models, both the Zero-inflated and Hurdle models capture the zero
observations accurately. However, both the Poisson regression and the Negative Binomial
regression models underestimate zero observations. As the percentage of zeros increases to 0.9,
the HNB and ZINB models continue to predict zero observations accurately, while the ZIP
model overestimates zero observations when the percentage of zeros is 10%, 30% and 50%. This
may be related to the failure to account for the effects of dispersion.
In addition to predicting the number of zero observations, the capability of the prediction of
structural zeros is also of interest. When p=0.1, 0.3, 0.5, 0.7, and 0.9, the expected number of
structural zeros is 25, 75, 125, 175, and 225, respectively (given n=250). When the dispersion
rate equals 0.5, the true ZINB model was likely to overestimate structural zeros when the
proportion of zeros is 10% and was more likely to underestimate the structural zeros when the
proportion of zeros increases.
When the dispersion rate equals to 1, the true ZINB model has better prediction of structural
zeros than the previous case, yet it still underestimates structural zeros when the proportion of
zeros increases to 70% and 90%, and overestimates structural zeros when the proportion of zeros
is 10%. Again, the misspecified ZIP model will overestimate the proportion of structural zeros.
53
When the dispersion rate increases to 2 and 4, results are similar. The true ZINB model is
more likely to overestimate the structural zeros when the proportion of zeros is small and is more
likely to underestimate the structural zeros when the proportion of zeros increases, but the
prediction bias is relatively small. The misspecified ZIP model overestimates the proportion of
structural zeros significantly across all the different proportions of zeros.
When the true model is ZINB, regardless of dispersion rate, it indicates that both the Zero-
inflated and Hurdle models predict zero observations well in this case. Regarding the prediction
of strucutral zeros, the true model ZINB model tends to slightly underestimates structural zeros
when the percentage of zeros is small, and overestimates structural zeros when the percentage of
structural zeros gets large, however the ZIP model is likely to significantly overestimate the
structural zeros, which is possibly because the ZIP model could not capture the effect of
dispersion, thus credits all the dispersion to the possible existence of many zero. What is more,
the prediction error gets smaller as the dispersion rate gets larger. Thus in the real research, it is
important to test the data dispersion before making decisions on which model to employ, since
using the ZIP model with when the data has large variance will cause significant biases in
predicting structural zero.
Pseudo-Population Hurdle Negative Binomial Model
Finally, we analyze the results when the pseudo-population is the Hurdle Negative
Binomial (HNB) Model. The convergence rates for the five different proportions of structural
zeros and four levels of dispersion are displayed in Table 2-30. Different from the previous
situations, the model has the smallest convergence rate when the proportion of zeros is 70%, and
when the proportion of zeros increases to 90%, convergence improves. As the dispersion rate
increases, the convergence rate improves, while when the proportion of zeros increases up go
70%, the convergence rate is gets worse.
54
Model fit
The Loglikelihood and AIC statistics when the data is Hurdle Negative Binomial with a
dispersion rate of 0.5 are shown in Tables 2-31-2-32. In this case, when the proportion of zeros is
10%, 30%, 50% and 70%, the true HNB model has the lowest log-likelihood. The Zero-inflated
Negative Binomial (ZINB) model is the best alternative among the misspecified models. When
the proportion of zeros increases to 90%, the misspecified ZINB model behaves even better than
the true model, which indicates that given a high percent of zeros (90%), the HNB model does
not handle the dataset as well.
When the true model has the Negative Binomial formulation, both the ZIP and HP
models have poor model fit, indicating, not surprisingly, the Poisson formulation has poor model
fit when the data has the issue of dispersion. What is more, when the proportion of zeros is 10%,
the Negative Binomial model behaves better than the Zero-inflated Poisson and Hurdle Poisson
models. As the proportion of zeros increases, the model fit of the NB model gets worse.
Similar results are found when the dispersion rate is 1, 2, and 4 in Table 2-33 to Table 2-
38. When the proportion of zeros is 10%, 30%, 50% and 70%, the true HNB model has the best
model fit. When the proportion of zeros increases to 90%, the misspecified ZINB model has
better model fit than the true model.
The AIC statistics for the true HNB model given different proportions of zeros and
dispersion rates are shown in Table 2-39. When focusing on the HNB model itself, we find that
as the dispersion rate and the proportion of zeros increase, the model fit improves.
Relative bias of E(Y|X)
The relative bias for E(Y|X) in the case when the true distribution is Hurdle Negative
Binomial at different dispersion rates are shown in Table 2-40 to Table 2-43.
55
When the true model is HNB with dispersion rate equal to 0.5, it is not the true HNB
model that has the best prediction of Y, but the misspecified HP when the proportion of zeros is
smaller (10%, 30% and 50%). When the zero proportion increases to 70% and 90%, it is the
misspecified ZIP model model that has the smallest bias. When the proportion of zeros is
relatively small (from 10% to 50%), the true HNB model is the second best model, with relative
bias from 10% to 30%. However, when the proportion of zeros increases to 70% and 90%, the
true model HNB has very large bias. Besides, we could also find that as the proportion of zero
getting larger, the relative bias of the true model HNB is getting larger.
When the dispersion rate increases to 1, the true HNB model has the least relative bias
when the proportion of zeros is relatively small. As the proportion of zeros increases, the
misspecified HP model has the least relative bias until the proportion of zeros is 50%, and the
ZIP model has the least relative bias when the proportion of zeros increases to 70% and 90%.
Similar results were found when the dispersion rate is 2 and 4. When the proportion of
zeros is relatively small, the true HNB model has the least bias, and the HP model is the best
alternative among the remaining five misspecified models. When the proportion of zeros
increases to 50%, the HP model has the least bias rather than the true HNB model. As the
proportion of zeros increases to 90%, the ZIP model has the least relative bias, and the ZINB
model is the second best. Neither the true model HNB nor the HP model perform well regarding
relative bias when the proportion of zeros is large.
When focusing the performance of the true model HNB alone (Table 2-44), when the
dispersion rate is 0.5 and 1, the relative bias is increases as the proportion of zeros increases.
Secondly, it also indicates that the HNB model has a significant bias in predicting E(Y) when the
proportion of zeros is large.
56
Abilities of capturing zero observation
Regarding the model’s ability to predict zero observations for the six models given the
true HNB model, the observed zeros in each dataset, together with the predicted zero
observations from the six different models are displayed in Tables 2-45 to 2-48 for different
proportions of zero observations, and under different dispersion rates. Both the Hurdle Poisson
and Hurdle Negative Binomial models capture the zeros accurately across all the five proportion
of zeros. At the same time, both the Zero-inflated Poisson (ZIP) and Zero-inflated Negative
Binomial (ZINB) model are more likely to overestimate the zero observations, especially when
the proportion of zeros is small (10%, 30%, and 50%). As the proportion of zeros increases, the
Zero inflated models prediction improves. When the dispersion rate is 0.5, the ZINB model
overestimate the proportion of zeros more than the ZIP model.
When the dispersion rate increases to 1 and 2, results are similar. The Hurdle models
capture zero observations accurately, and the Zero-inflated models tend to overestimate the zero
observations when the proportion of zero observations is relatively small (10% and 30%). When
the dispersion rate increases to 4, the Hurdle models again capture the zero observations very
accurately. However, different from the results from the previous situations, the Zero-inflated
models do not overestimate the zero proportions as much as previously. Thus if the zero-
prediction is compared under different levels of dispersion rates, it indicates that higher rates of
dispersion leads to less bias caused by using the misspecified models (ZIP and ZINB).
Compare the Mis-specified Models across All the True Models
The mean AIC statistics of the four modified count data models under four different data
generating process are shown in Table 2-49. The columns indicate the true data generating
model, and the rows indicate the model that is employed. The average of the AIC statistics of
57
each model given each data generating process across all the different proportions of zero and
dispersion rates are shown.
Across all four model generating scenarios, it is always the true model that has the lowest
average AIC statistics. Among the true models, the ZIP model has the lowest average AIC
statistic.
When examining the misspecified models, when the true model is ZIP, the best
alternative model is ZINB. In this case, the AIC of the HP model is larger, especially when the
proportion of structural zeros is small, indicating a the HP does not fit well when the ZIP is the
true model.When the true model is ZINB, the best alternative model is HNB, and in this case the
AIC statistics of both ZIP and HP model are large. Similar results were found when the true
model is HP, the best alternative model is HNB, and the misspecified ZIP model has the worst
model fit. When the true model is HNB, the best alternative model is ZINB, and both HP and
ZIP have significantly larger AIC statistics.
Overall, the Negative Binomial formulation has good fit based on the AIC statistics when
the data has significant variance (with the Negative Binomial formulation). However, when the
true model has the Poisson formulation, the Negative Binomial formulation still has good fit
based on AIC models.
The average relative bias of the four modified count data models under four different data
generating processes are shown in Table 2-50. Each column indicates the true data generating
model, and each row shows the model that is employed. The average relative bias of each model
given each data generating process across all the different proportions of zeros and dispersion
rates are shown.
58
Across all four model generating scenarios, it is not always the true model that has the
least bias. Overall, the ZIP model has the lowest prediction bias of Y. The results also indicate
that models with Negative Binomial formulation are observed to be highly biased. This is
because the models with Negative Binomial formulation have poor prediction when there are
many zero observations.
When considering the possible impact of employing misspecified models, the ZINB
model has the least bias when the true model is ZIP, with the average bias smaller than the true
ZIP model. Both the ZIP and ZINB models perform well, and better than the true model,when
the true model is HP. When the true model is ZINB and HNB, the ZIP has the least bias.
Overall, models with the Negative Binomial formation have large bias in predicting the
mean of Y. Among all the different data generating processes, the ZIP model performs
reasonably well in all cases.
At last, in terms of predicting zero observations, the results are highly dependent on the
assumption of the source of zeros. If only one type of zero exists, Hurdle models will always
predict the zero observations accurately, while the Zero-inflated models tend to overestimate
zero observations. If two types of zero exist, both Hurdle models and Zero-inflated models
predict zeros accurately, however, the Hurdle models fail to differentiate the two types of zero.
Conclusion
This research was conducted to compare the performance of the most commonly used
count data models given different data-generating processes, proportion of zeros, and data
dispersion. Although Poisson regression techniques are the most popular when dealing with the
count data, the assumption of “mean equals variance” of Poisson regression is violated with the
existence of many zeros. As a result, modified Poisson regression models have been designed
and employed in count data analysis with zero-inflation issues. The most commonly used
59
modified models include the Zero-inflated Poisson regression (ZIP) and the Hurdle Poisson
regression (HP) models. Concerning the possible issue of over-dispersion, both the Zero-inflated
and Hurdle models can also be estimated using a Negative Binomial distribution.
The comparative performance of these commonly used count data models has rarely been
examined using simulation studies in previous research. Additionally, there is even less research
that compares these models while allowing different variations of the proportion of zeros and
dispersion of the data.
This research addresses the knowledge gap by comparing the performance of the six
commonly used count data models using simulation and quantifying the models’ capabilities of
predicting zero observations under different simulation conditions. These simulation conditions
included four different data generating distributions, five different proportions of zeros, and four
different dispersion rates. Special attention was given to differentiating the amounts of different
types of zeros to allow the evaluation of the proportion of correctly identified structural zeros for
the Zero-inflated Models, and test of the consequences, if any, of misspecifying the latent classes
for zeros given different levels of structural zeros.
To answer these questions, a Monte-Carlo experiment was employed, and thestatistics
evaluated include the Loglikelihood, AIC, relative-bias of the prediction mean, and predictions
of zero observations. This experiment was designed to provide practical recommendations to
researchers when choosing among the count data models.
The conclusion will be discussed in two sections,first, the models’ performance under
four different Pseudo-populations will be discussed, together with the influence of proportion of
zeros and dispersion in each senario. Second, we will focus on the discussion about the choice of
different models based on this research.
60
Model Performance Given Four Pseudo-Populations
True model - Zero-inflated Poisson Model
In terms of model fit, when the true model is the Zero-inflated Poisson Regression Model
(ZIP), the true model has the best model fit. Among the remaining misspecified models, the
Zero-inflated Negative Binomial (ZINB) model is very close to the true model, followed by the
Hurdle Possoin (HP) and Hurdle Negative Binomial (HNB), which remain close to the true
model, especially when the percentage of zeros is higher. When the ZIOP is the true model, the
Poisson and Negative Binomial regression models have very poor model fit, indicating that they
do not handle cases when the problem of zero-inflation exists.
The true ZIP model also performs well in terms of the prediction of Ywhen the percentage
of zeros is relatively small. However, as the percentage of zeros increases, the misspecified
ZINB model has smaller bias than the true model. The HP and HNB models have large bias,
especially when the proportion of zeros is large.
When comparing the models’ capacity to predicting zeros, the ZIP, ZINB, HP, and HNB all
predict the observed zero observations accurately. However, since the Zero-inflated models
allow for different generating processes for structural and sampling zeros, . both the true ZIP
model and ZINB model accurately capture the structural zeros when the percentage of zeros is
relatively small. Asthe percentage of zeros increases, the ZINB model tends to underestimate the
structural zeros. Due to the assumptions of the models, Hurdle models fail to differentiate the
two types of zeros.
What is more, focusing on the true ZIP model itself, as the percentage of zeros increases,
model fit improves, yet the prediction of average of Y is gets worse. As the number of zero
observations increases, is becomes more difficult for the model to converge.
61
True model - Hurdle Poisson Model
When the true model is the Hurdle Poisson Regression Model (HP), the true model has the
best model fit, and the HNB model is the best among the five misspecified models, which is very
close to the true model. Following HNB, the Zero-inflated Poisson (ZIP) and Zero-inflated
Negative Binomial (ZINB) modelsare close to the true model especially when the percentage of
zeros is high. Again, in this case, neither the Poisson nor Negative Binomial regression model
have good model fit, indicating that they do not handle data when zero-inflation exists.
Regarding the relative error rate of the prediction of Y, the true model HP is not always the
best model. When the percentage of zeros is relatively small, the Hurdle Negative Binomial
(HNB) Model has the least bias. When the percentage of zeros is extremely large (greater than
90%), the ZIP model has the least bias, and the true HP model HP has large bias. Because of the
large bias when the percentage of zeros is large, on average, it is the ZIP model, not the true HP
model, that has the least average bias.
When comparing the models’ capacity of predicting zeros, we find that both the true HP
model and HNB predict the observed zero observations accurately.The Zero-inflated models are
likely to overestimate zeros in this case.
Overall, as the percentage of zeros increases, model fit for the true HP model improves, but
the prediction of the average of Y is gets worse. In addition to poorprediction of Y when there
are many zero observations, the model is gets more difficult to converge.
True model - Zero-inflated Negative Binomial Model
When the true model is the Zero-inflated Negative Binomial (ZINB), the true model ZINB
has the best model fit, and the HNB model is the best among the five misspecified models.The
ZIP and HP models have very poor fit across all the different proportions zeros and dispersion
rates. Regarding the impact of the dispersion, both the true ZINB model and HNB get worse as
62
the dispersion rate increases. Again in this case, neither the Poisson nor the Negative Binomial
regression models have good fit.
Regarding the relative error rate of the prediction of Y, the true ZINB model is not always
the best. When the percentage of zeros is relatively small, the true ZINB model tends to have the
least bias, and the ZIP model performs nearly as well. However, when there are relatively more
zeros, the misspecified ZIP model has the least bias. Regarding the impact of dispersion rate, the
true model has the least bias when the dispersion rate is 0.5 and 4, and the misspecified ZIP
model has the least average bias when dispersion rate is 1 an 2. Both the HP and HNB models
have very large bias, especially when the percentage of zeros is greater than 50%.
When comparing the models’ capacities for predicting zeros, the true ZINB model, together
with the HNB, ZIP and HP model all predict the observed zero observations accurately.
However, the true ZINB model tends to overestimate structural zeros, while the misspecified ZIP
model tends to underestimate structural zeros.
As the percentage of zeros increases, the model fit of the ZINB improves. Asthe dispersion
rate increases, the model fit gets worse. In terms of the relative bias, the model’s prediction is
decreases with an increase in the proportion of zeros.
True model - Hurdle Negative Binomial Model
Finally when the true model is Hurdle Negative Binomial (HNB), the true HNB model has
the best model fit when the percentage of zeros is relatively small. When the percentage of zeros
increases to more than 70%, the misspecified ZINB has better model fit. The ZIP and HP models
have very poor fit across all the different levels of zero proportion and dispersion rate.
Considering the relative error rate of predicting Y, when the proportion of zeros is relatively
small, the misspecified HP has the least bias, and the true model HNB is the second best. When
the proportion of zeros increases, the ZIP model has the least bias, and both the true HNB and
63
HP models have very large prediction errors. Across dispersion rates, the ZIP is always the least
biased when P(zero)>50%, and the true HNB model has very high bias when P(zero)>50%.
Regarding the models’ capacity for predicting zeros, the true HNB model and the HP model
predict the observed zero observations accurately. Both the ZIP and ZINB models tend to
overestimate zero observations.
As the percentage of zeros increases, model fit improves, while the prediction of Y gets
worse. As the dispersion rate increases, model fit improves, and no significant patterns are
captured as the dispersion rate changes.
Comparison between Different Models
Poisson formulation versus Negative-Binomial formulation
To better understand the models, we will compare models with the Poisson formulation to
models with the Negative Binomial formulation under four different model distributions. The
comparison will be conducted based on both the model fit and relative bias of prediction.
In terms of the model fit, when the true model is ZIP, the best alternative model is ZINB,
however, when the true model is ZINB, the best alternative model is HNB instead of the ZIP (in
this case the AIC statistics of both ZIP and HP model are extremely large). Similarly, when the
true model is HP, the best alternative model is HNB, and the misspecified ZIP model does not fit
well. When the true model is HNB, the best alternative model is ZINB, and both HP and ZIP
have significantly poor model fit. Thus when the true model has a Poisson formulation, its
Negative Binomial formulation will be a good alternative, yet when the true model has a NB
formulation, the misspecification of the model with Poisson formulation will cause poor model
fit.
The models can also becompared based on the relative bias of E(Y|X). When the true model
is ZIP, the ZINB has the second least bias following the true model and when the zero proportion
64
increases, the misspecified ZINB model has smaller relative bias than the true ZIP model. When
the true model is HP, the HNB has the least bias except for the true model, and the bias is
relatively small especially when the zero proportion is small. However, when the true model is
ZINB, the alternative best model is the ZIP; and when the true model is HNB, the alternative
best model is the HP. Overall, models with Negative Binomial formulation are observed to be
highly biased when predicting Y. This is likely related to models with Negative Binomial
formulation having poor prediction when there are many zero observations.
Thus, given that researchers have long been concerned about the choice of models with
either Poisson or NB formulations, results from this study strongly suggest that a test of
“dispersion” is needed before determining the model. If the dataset does have “dispersion”,
misspecifying the models with Poisson formulation will cause bad poor fit. On the other hand,
when the true model has the Poisson formulation, using a model with NB formulation will not
cause severe bias except when there are many zero observations.
Another aspect of the data that can impact model choice is the proportion of zeros and
strutural zeros in the data. Results showed that the percentage of zeros does not only influence
the model performance for the model itself, but also influences the comparison of the Poisson
and Negative Binomial formulations.
When the true model is ZIP, as the proportion of structural zeros increases, model fit of the
true model improves. As the proportion of structural zeros increases, the misspecified models
HP, HNB and ZINB all improve in terms of model fit. However, the difference between the AIC
statistics of the true ZIP model and the best alternative model, ZINB, increase. In other words,
when the true model is ZIP, the advantage of using the true model rather than the next best
alternative, ZINB, becomes more significant. Similarly, when the true model is HP, as the
65
proportion of zeros increases, the true model fit improves. The difference between the AIC
statistics of the true model and the next best alternative (HNB) increases, indicating that the
benefit of using the true model is more significant.
When the true model is ZINB, regardless of the dispersion rate, the model fit of the true
model improves as the proportion of structural zeros increases. In this case, neither the ZIP nor
HP model have good model fit , indicating the model fit is worse when employing the
misspecified Poisson formulation, however, the difference of AIC statistics between the ZIP and
ZINB models decreases as the zero proportion increases. What is more, regarding the relative
bias of prediction, the error rate of the misspecified model ZIP decreases with the increase in
proportion of zeros. When the zero proprtion gets large, the bias of the ZIP is even smaller than
the ZINB model.
When the true model is the HNB, the model fit of the true model improves as the proportion
of zeros increases regardless of the dispersion rate. Again, in this case, both the ZIP and HP
model have poor model fit in terms of AIC statistics. While, regarding the relative bias of
prediction, the misspecified HP model has the least bias when the proportion of zeros is
relatively small and the misspecified ZIP model has the least bias when the proportion of zeros
increases.
The propotion of zeros impact model fit, as does the dispersion rate. When the true model is
ZINB, regardless of the proportion of zeros, the ZINB model has the best model fit when the
dispersion rate equals to 0.5. As the dispersion rate increases, the model fit decreases. In terms of
model comparison, the next best alternative (HNB) also has the best model fit when the
dispersion rate is 0.5. As the dispersion rate increases, the difference between the AIC statistics
for the ZINB and HNB models increases slightly, indicating that the advantage of using the true
66
model gets more significant. Again, across all the four levels of dispersion rates tested, the AIC
statistics of both the HP and ZIP models are quite large, which indicate a poor fit in this case.
Finally, we also find that if the true distribution is either a Zero-inflated model or Hurdle
model, neither the Poisson regression model nor the NB regression model have good model fit,
which indicates that the simple one stage model cannot handle the data well when there exist
excessive zeros, and when it is assumed that the zero observations are generated differently.
Zero-inflated models versus Hurdle models
Prior research that compared Zero-inflated models with Hurdle models has been mostly
based on the assumption of how the zero observations were generated and very little research
compared model performance, especially using a simulation method. This study is the first to
examine both the ZIP and HP models with their Negative Binomial formulations under
simulation by controlling the proportion of zeros and dispersion rate.
First focusing the true model itself, the ZIP model has better model fit than HP, and the
ZINB model has a better model fit than the HNB when applying the true model to the data set
that generated from its real population. In terms of the model fit, across different proportions of
zero and dispersion rates, it indicates that the ZIP model has a larger relative bias than the HP
model when applying the true model to the data generated from its real population; similarly, the
ZINB has a larger bias than HNB.
Next, when considering the possible impact of utilizing a misspecified model, we find when
the true model is the ZIP, the misspecified model HP has close performance and relative bias
compared to the true model especially when the proportion of structural zeros is large.
Conversely, when the true model is the HP, the misspecified model ZIP has a smaller bias than
the true model, and similar model fit performance.This is maily because the HP model has a very
large bias when the proportion of zeros is very large.
67
When the true model is the ZINB, the misspecified HNB has close model fit compared to the
true model, yet the relative bias of the HNB model is very large. What is more, the relative bias
of the misspecified HNB model increases as the proportion and the over-dispersion rates
increase. Specifically, when the zero proportion increases to more than 50%, the relative bias of
mean of the HNB model is more than 100% regardless of the over-dispersion rate. When the true
model is the HNB, the misspecified model ZINB has close performance in terms of model fit,
and the relative bias of the misspecified ZINB model in better than the true HNB model.
Thus, based on this study, the ZIP model appears to be the better performing model in terms
of the model fit. Similarly, the ZINB model appears to be the better performing model in terms
of the model fit, and also has a better prediction capability than the HNB. If the researcher has no
prior assumption of the zero-generating processes, the Zero-inflated model is preferred over the
Hurdle models when choosing between these two competing models. Specifically, attention
should be given to the case when the proportion of zeros is more than 50%. With many zero
observations in the dataset, the Hurdle models tend to have a large relative bias.
Capability of Predicting Zero Observations
In this analysis, we have paid special attention to the models’ capabilities of predicting zeros
and structural zeros (for Zero-inflated models). When the true model is the ZIP, we find that the
HP, HNB, ZIP and ZINB predict zeros very accurately. The Hurdle models always capture the
zero prediction accurately because of the model design. However, in terms of the prediction of
structural zeros, when comparing the ZIP with the ZINB models, both models predict structural
zeros accurately when the zero proportion is smaller than 0.5. When the zero proportion
increases to larger than 50%, the ZINB model likely underestimates the proportion of structural
zeros.
68
When the true model is ZINB, again, the Hurdle models accurately capture the proportion of
zeros. However, in terms of the prediction of structural zeros, the ZIP model significantly
overestimates the number of structural zeros especially when the over-dispersion rate is small.
As for the ZINB model, it has good prediction of the structural zeros when the proportion of
structural zeros is smaller than 50%. When the proportion of structural zeros is larger than 50%,
the ZINB model likely underestimates the number of structural zeros.
When the true model is HP, no structural zeros exist and we only compare the capability of
predicting zeros. In this case, we find that both the ZIP and ZINB models overestimate zero
observations. Similarly, when the true model is HNB, both the ZIP and ZINB overestimate zeros,
however, when the over-dispersion rate gets as high as 4, the issue of over-dispersion for Zero-
inflated models decreases.
Thus, by comparing the four models’ capability of capturing the zero observations, we find
that Hurdle models always predict the number of zeros accurately. However, Hurdle models
cannot differentiate between different types of zeros. Zero-inflated models allow two different
types of zero, yet they may overestimate structural zeros when the proportion of zeros is
relatively small. What is more, when the true model is a Hurdle model, but Zero-inflated models
are used to model the data, zeros are likely to be overestimated. As a result, we strongly suggest
researchers first decide what assumptions to make about the process of generating zeros (whether
or not to expect two types of zeros). If in the research, the focus is mainly on the prediction of
the proportion of zeros, the Hurdle models will provide accurate estimation, yet, if the focus
includes the possible existence of different types of zeros, the Zero-inflated models are preferred.
Future Work and Limitations
As mentioned in the previous sections, sample size is an important factor influencing
simulation experiments, thus affecting the final results. In this experiment, the sample size is set
69
at 250. In the future, it would be interesting to explore the possible influence of sample size on
the model performance by repeating the simulations with different sample sizes. This can be
particularly important given survey research sometimes yields smaller sample sizes.
In this analysis, the number of independent variables was set to be two, with one as binary
and another one as continuous following a normal distribution. However, these two covariates
are set to be independent in this analysis, and no correlation was considered, which might not
always be true with real data. Allowing the two covariates to be dependent, and exploring how
will the different levels of covariate correlation influences the model performance is another area
of interest.
Additionally, in this analysis, only the model fit and their capabilities of predicting zero are
recorded and analyzed. Parameter recovery and confidence intervals were not analyzed. Given
researchers are often interested in the parameters and significance, understanding how the model
will capture the true parameters in the data analysis is another area for future study.
70
Table 2-1. Convergence rate, true model is ZIP
10% 30% 50% 70% 90%
Convergency
rate 100% 99.9% 99.6% 95.4% 86.1%
simulation size=1000
Table 2-2. Mean Loglikelihood, true model is ZIP
Preferred model under each case is indicated with bold.
Table 2-3. Mean AIC, true model is ZIP
P
(structural
zero)
Poisson NB HP ZIP HNB ZINB
10% 3541 1374 1040 1025 1042 1027
30% 7906 3161 939 921 941 923
50% 9589 4148 762 749 764 751
70% 9636 5128 530 522 533 528
90% 4975 3148 226 223 234 232
Average 7129.4 3391.8 699.4 688 702.8 692.2
Preferred model under each case is indicated with bold.
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% -1767.7 -683.1 -514.0 -506.6 -513.9 -506.5
30% -3950.2 -1576.5 -463.4 -454.5 -463.4 -454.4
50% -4791.4 -2069.8 -374.9 -368.4 -374.8 -368.7
70% -4814.8 -2559.9 -258.8 -254.9 -259.5 -256.8
90% -2484.5 -1569.7 -106.9 -105.4 -110.1 -108.9
Average -3561.7 -1691.8 -343.6 -337.9 -344.3 -339.1
71
Table 2-4. Relative Bias for E(Y|X), true model is ZIP
P
(structrual zero) Poisson NB ZIP ZINB HP HNB
10% 0.978 0.978 -0.013 -0.014 -0.125 -0.126
30% 0.976 0.974 -0.071 -0.071 -0.386 -0.386
50% 0.973 0.967 -0.170 -0.169 -0.719 -4.1E+63
70% 0.974 0.963 -0.490 -0.485 -1.543 -1E+114
90% 1.252 1.263 -1.409 -1.015 -6.122 -1.3E+147
Average 1.031 1.029 -0.431 -0.351 -1.779 -2.6E+146
Preferred model under each case is indicated with bold.
Table 2-5. Observed and predicted zero observations, true model is ZIP
P
(structual
zero)
Obs Poisson NB HP ZIP ZIP(structural
zero) HNB ZINB
ZINB(structural
zero)
0.1 83 68 76 83 84 27 83 84 27
0.3 121 74 99 121 121 75 121 121 75
0.5 158 85 121 158 158 125 158 158 125
0.7 195 94 129 195 195 175 195 195 171
0.9 231 122 152 231 231 224 231 231 197
Table 2-6. Convergence rate, true model is HP
10% 30% 50% 70% 90% Convergency
rate 90.2% 84.2% 75.0% 71.1% 75.4%
simulation size=1000
72
Table 2-7. Mean Log-likelihood, true model is HP
P
(zero) Poisson NB HP ZIP HNB ZINB
10% -1901.1 -756.0 -515.1 -616.5 -515.0 -616.0
30% -4043.4 -874.5 -490.2 -550.5 -490.2 -550.1
50% -5074.1 -1034.5 -414.0 -448.4 -414.0 -448.2
70% -4681.3 -1396.8 -294.0 -310.0 -294.0 -310.1
90% -2293.0 -960.2 -123.6 -127.1 -135.0 -128.2
Average -3598.6 -1004.4 -367.4 -410.5 -369.6 -410.5
Preferred model under each case is indicated with bold.
Table 2-8. Mean AIC, true model is HP
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% 3808 1520 1042 1245 1044 1246
30% 8093 1757 993 1113 994 1114
50% 10154 2077 840 909 842 911
70% 9369 2802 600 632 602 634
90% 4592 1928 259 266 284 271
Average 7203.2 2016.8 746.8 833.0 753.2 835.2
Preferred model under each case is indicated with bold.
Table 2-9. Relative Bias for E(Y|X), true model is HP
P
(zero) Poisson NB ZIP ZINB HP HNB
10% 0.978 0.973 0.198 0.207 0.070 0.070
30% 0.975 0.967 0.334 0.337 0.065 0.064
50% 0.969 0.961 0.407 0.409 -0.050 -8.8E+14
70% 0.961 0.952 0.422 0.424 -0.258 -0.259
90% 1.167 1.150 0.179 0.221 -4.546 -2E+99
average 1.010 1.001 0.308 0.320 -0.944 -1.76E+1
Preferred model under each case is indicated with bold.
73
Table 2-10. Observed and predicted zero observations, true model is HP
P
(zero) Obs Poisson NB HP ZIP HNB ZINB
10% 25 64 56 25 74 25 74
30% 75 69 82 75 105 75 105
50% 124 71 117 124 141 124 141
70% 175 78 145 175 181 175 181
90% 226 116 178 226 226 226 226
Table 2-11. Convergence rate, true model is ZINB
10% 30% 50% 70% 90%
Dispersion=0.5 96.7% 94.7% 92.2% 88.3% 82.6%
Dispersion=1 98.5% 97.9% 93.7% 88.1% 82.9%
Dispersion=2 99.8% 99.4% 97.0% 92.7% 85.6%
Dispersion=4 100% 99.6% 98.6% 92.5% 86.1%
simulation size=1000
Table 2-12. Mean Log-likelihood, true model is ZINB (disperison=0.5)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% -9282.9 -1974.7 -7436.0 -7436.9 -616.5 -614.5
30% -9240.1 -2836.1 -5364.9 -5364.2 -521.2 -518.2
50% -8331.4 -3408.3 -3548.4 -3547.3 -406.1 -403.0
70% -6483.6 -3157.3 -1861.8 -1861.5 -266.5 -264.5
90% -2559.1 -52550.0 -327.6 -327.7 -105.4 -102.8
Average -7179.4 -12785.3 -3707.7 -3707.5 -383.2 -380.6
Preferred model under each case is indicated with bold.
74
Table 2-13. Mean AIC, true model is ZINB (disperison=0.5)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% 18571.9 3957.4 14883.9 14885.8 1247.1 1243.1
30% 18486.2 5680.1 10741.9 10740.5 1056.5 1050.3
50% 16668.8 6824.5 7108.9 7106.5 826.2 820.1
70% 12973.2 6322.6 3735.7 3734.9 547.0 543.1
90% 5124.2 6110.1 667.2 667.5 224.7 219.7
Average 14364.8 5578.9 7427.5 7427.0 780.3 775.3
Preferred model under each case is indicated with bold.
Table 2-14. Mean loglikelihood, true model is ZINB (dispersion=1)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% -6442.8 -1259.7 -4960.2 -4960.1 -647.1 -644.0
30% -7136.6 -2414.8 -3610.4 -3607.6 -555.4 -550.6
50% -6859.7 -3148.4 -2328.2 -2325.3 -434.1 -430.2
70% -5719.4 -2931.7 -1273.3 -1271.4 -290.8 -288.5
90% -2457.0 -1407.9 -266.2 -265.7 -115.5 -113.7
Average -5723.1 -2232.5 -2487.6 -2486.0 -408.6 -405.4
Preferred model under each case is indicated with bold.
Table 2-15. Mean AIC, true model is ZINB (disperison=1)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% 12891.6 2527.5 9932.4 9932.1 1308.2 1301.9
30% 14279.3 4837.7 7232.9 7227.2 1124.9 1115.2
50% 13725.4 6304.8 4668.4 4662.6 882.1 874.4
70% 11444.8 5871.6 2558.5 2554.9 595.6 590.9
90% 4920.0 2823.9 544.4 543.4 245.1 241.4
Average 11452.3 4473.1 4987.3 4984.1 831.2 824.7
Preferred model under each case is indicated with bold.
75
Table 2-16. Mean loglikelihood, true model is ZINB (dispersion=2)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% -4271.3 -920.7 -2938.5 -2936.1 -641.6 -636.9
30% -5484.7 -1877.1 -2134.8 -2129.9 -556.2 -549.7
50% -6095.8 -2726.5 -1480.1 -1476.1 -439.6 -434.6
70% -5316.4 -2760.0 -818.6 -816.0 -295.8 -292.6
90% -2459.7 -1472.2 -203.6 -202.6 -117.8 -116.2
Average -4725.5 -1951.3 -1515.1 -1512.1 -410.2 -406.0
Preferred model under each case is indicated with bold.
Table 2-17. Mean AIC, true model is ZINB (disperison=2)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% 8548.5 1849.4 5889.0 5884.2 1297.3 1288.0
30% 10975.6 3762.3 4281.6 4271.8 1126.4 1113.3
50% 12197.7 5460.9 2972.4 2964.2 893.4 883.3
70% 10638.7 5528.0 1649.3 1643.9 605.5 599.3
90% 4925.5 2952.4 419.2 417.3 249.5 246.4
Average 9457.2 3910.6 3042.3 3036.3 834.4 826.1
Preferred model under each case is indicated with bold.
76
Table 2-18. Mean Loglikelihood, true model is ZINB (disperison=4)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% -3072.9 -776.7 -1779.1 -1774.7 -623.5 -617.5
30% -4705.2 -1644.5 -1392.9 -1386.2 -545.7 -538.1
50% -5403.6 -2495.2 -923.5 -917.9 -431.9 -425.9
70% -4997.9 -2810.9 -547.4 -544.1 -293.1 -289.9
90% -2490.1 -1620.3 -159.2 -157.9 -117.6 -116.1
Average -4133.9 -1869.5 -960.4 -956.2 -402.4 -397.5
Preferred model under each case is indicated with bold.
Table 2-19. Mean AIC, true model is ZINB (disperison=4)
P
(structrual
zero)
Poisson NB HP ZIP HNB ZINB
10% 6151.9 1561.5 3570.2 3561.4 1260.9 1249.1
30% 9416.4 3297.0 2797.8 2784.4 1105.5 1090.4
50% 10813.2 4998.3 1859.0 1847.9 877.8 865.9
70% 10001.7 5629.9 1106.9 1100.1 600.1 593.9
90% 4986.1 3248.6 330.5 327.9 249.3 246.3
Average 8273.9 3747.1 1932.9 1924.3 818.7 809.1
Preferred model under each case is indicated with bold.
Table 2-20. Mean AIC of ZINB model, true model is ZINB
P
(strutural zero) D=0.5 D=1 D=2 D=4
Average
10% 1243.1 1301.9 1288.0 1249.1 1270.5
30% 1050.4 1115.2 1113.3 1090.4 1092.3
50% 820.0 874.4 883.3 865.9 860.9
70% 543.1 590.9 599.3 593.9 581.8
90% 219.7 241.4 246.4 246.3 238.4
Average 775.3 824.8 826.1 809.1 808.8
77
Table 2-21. Relative Bias for E(Y|X), true model is ZINB (dispersion=0.5)
P
(strutural
zero)
Poisson NB ZIP ZINB HP HNB
10% 0.971 0.973 -0.118 -0.099 -0.253 -0.314
30% 0.966 0.968 -0.209 -0.258 -0.514 -0.610
50% 0.960 0.961 -0.213 -0.334 -0.870 -9.8E+42
70% 0.958 0.964 -0.803 -0.590 -14.90 -1.4E+85
90% 1.261 1.267 -0.024 0.2504 -1680.57 -2E+106
Average 1.023 1.0273 -0.274 -0.206 -339.422 -4E+102
Preferred model under each case is indicated with bold.
Table 2-22. Relative Bias for E(Y|X), true model is ZINB (dispersion=1)
P
(strutural
zero)
Poisson NB ZIP ZINB HP HNB
10% 0.973 0.975 -0.121 -0.081 -0.234 -0.256
30% 0.970 0.970 -0.190 -0.208 -0.457 -0.574
50% 0.965 0.962 -0.214 -0.295 -0.756 -1.7E+65
70% 0.947 0.942 0.353 0.413 -0.494 -1.3E+40
90% 1.248 1.243 0.087 0.290 -1305.27 -2E+113
Average 1.021 1.0188 -0.017 0.023 -261.443 -4E+112
Preferred model under each case is indicated with bold.
Table 2-23. Relative Bias for E(Y|X), true model is ZINB (dispersion=2)
P
(strutural zero) Poisson NB ZIP ZINB HP HNB
10% 0.976 0.977 -0.07 -0.054 -0.168 -0.202
30% 0.973 0.971 -0.143 -0.142 -0.425 -0.473
50% 0.970 0.966 -0.240 -0.228 -0.817 -2E+65
70% 0.967 0.961 -0.446 -0.509 -1.485 -4.8E+96
90% 1.318 1.341 -1.384 -2.092 -56228.9 -2E+14
Average 1.041 1.0436 -0.457 -0.605 -11246.3 -4E+14
78
Table 2-24. Relative Bias for E(Y|X), true model is ZINB (dispersion=4)
P
(strutural
zero)
Poisson NB ZIP ZINB HP HNB
10% 0.977 0.977 -0.063 -0.047 -0.168 -0.175
30% 0.973 0.971 -0.125 -0.121 -0.431 -0.437
50% 0.971 0.966 -0.267 -0.271 -0.978 -0.928
70% 0.971 0.962 -0.463 -0.463 -1.561 -4.1E+55
90% 1.279 1.294 -1.500 -1.012 -9.5E+1 -2E+132
Average 1.034 1.034 -0.484 -0.384 -1.9E+1 -4E+131
Preferred model under each case is indicated with bold.
Table 2-25. Relative Bias for E(Y|X) of ZINB model, true model is ZINB
P
(strutural zero) D=0.5 D=1 D=2 D=4
Average
10% -0.099 -0.081 -0.054 -0.047 -0.070
30% -0.258 -0.208 -0.142 -0.121 -0.182
50% -0.334 -0.295 -0.228 -0.271 -0.282
70% -0.590 0.413 -0.509 -0.463 -0.287
90% 0.250 0.290 -2.092 -1.018 -0.642
Average -0.206 0.023 -0.605 -0.384 -0.293
Table 2-26. Observed and predicted zero observations, true model is ZINB (disipersion=0.5)
P
(structual
zero)
Obs Poisson NB HP ZIP ZIP(structural
zero) HNB ZINB
ZINB(structural
zero)
10% 120 63 107 120 122 95 120 120 33
30% 148 66 118 148 149 128 148 148 76
50% 177 73 127 177 178 161 177 177 117
70% 207 85 141 207 207 195 207 206 132
90% 235 128 169 235 235 227 235 235 129
79
Table 2-27. Observed and predicted zero observations, true model is ZINB (disipersion=1)
P
(structual
zero)
Obs Poisson NB HP ZIP ZIP(structural
zero) HNB ZINB
ZINB(structural
zero)
10% 102 64 94 102 104 68 102 102 29
30% 134 69 107 134 135 105 134 134 75
50% 168 76 117 168 168 145 168 167 123
70% 200 87 128 200 201 186 200 200 158
90% 233 126 164 233 233 225 233 233 158
Table 2-28. Observed and predicted zero observations, true model is ZINB (disipersion=2)
P
(structual
zero)
Obs Poisson NB HP ZIP ZIP(structural
zero) HNB ZINB
ZINB(structural
zero)
10% 93 66 86 93 95 50 93 93 28
30% 128 72 104 128 129 91 128 128 76
50% 162 80 117 162 163 135 162 162 125
70% 198 88 127 198 198 181 198 197 166
90% 232 125 159 232 232 225 232 232 174
Table 2-29. Observed and predicted zero observations, true model is ZINB (disipersion=4)
P
(structual
zero)
Obs Poisson NB HP ZIP ZIP(structural
zero) HNB ZINB
ZINB(structural
zero)
10% 88 68 81 88 89 39 88 88 27
30% 124 73 102 124 125 85 124 124 76
50% 160 80 116 160 160 131 160 160 125
70% 196 91 126 196 196 177 196 196 167
90% 232 122 151 232 232 224 232 232 188
80
Table 2-30. Convergence rate, true model is HNB
10% 30% 50% 70% 90%
Dispersion=0.5 83.3% 81.9% 80.4% 80.2% 82.0%
Dispersion=1 84.2% 80.1% 78.8% 75.1% 77.0%
Dispersion=2 83.6% 79.3% 76.0% 73.5% 78.6%
Dispersion=4 87.3% 82.3% 76.5% 72.4% 77.4%
simulation size=1000
Table 2-31. Mean Loglikelihood, true model is HNB (disperison=0.5)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% -9174.8 -891.6 -7674.1 -7736.6 -714.3 -785.4
30% -8929.8 -794.2 -5500.9 -5537.6 -642.3 -680.6
50% -8227.5 -936.9 -3601.6 -3623.7 -523.9 -544.1
70% -6222.3 -1039.8 -1853.1 -1864.1 -360.3 -369.3
90% -2624.9 -967.8 -411.8 -415.3 -154.7 -150.1
Preferred model under each case is indicated with bold.
Table 2-32. Mean AIC, true model is HNB (disperison=0.5)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% 18355.7 1791.3 15360.3 15485.2 1442.5 1584.8
30% 17865.6 1596.4 11013.8 11087.18 1298.6 1375.1
50% 16461.1 1881.9 7215.3 7259.3 1061.9 1102.3
70% 12450.6 2087.6 3718.2 3740.2 734.7 752.7
90% 5256.0 1943.7 835.6 842.5 323.5 314.2
Average 14077.8 1860.2 7628.6 7682.8 972.3 1025.8
Preferred model under each case is indicated with bold.
81
Table 2-33. Mean Loglikelihood, true model is HNB (disperison=1)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% -6158.3 -790.6 -4764.7 -4843.9 -687.2 -760.4
30% -6898.6 -813.7 -3540.4 -3588.4 -623.4 -664.9
50% -7084.0 -1146.0 -2426.6 -2454.1 -508.7 -531.4
70% -5458.7 -1118.2 -1278.8 -1292.1 -351.7 -362.3
90% -2345.9 -826.1 -300.8 -304.8 -150.7 -146.4
Average -5589.1 -938.9 -2462.3 -2496.6 -464.4 -493.1
Preferred model under each case is indicated with bold.
Table 2-34. Mean AIC, true model is HNB (disperison=1)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% 12322.7 1589.3 9541.4 9699.8 1388.3 1534.8
30% 13803.2 1635.3 7092.9 7188.7 1260.8 1343.8
50% 14174.0 2300.0 4865.3 4920.1 1031.5 1076.8
70% 10923.5 2244.4 2569.6 2596.2 717.5 738.6
90% 4697.9 1660.1 613.7 621.7 315.5 306.8
Average 11184.3 1885.8 4936.6 5005.3 942.7 1000.2
Preferred model under each case is indicated with bold.
Table 2-35. Mean Loglikelihood, true model is HNB (disperison=2)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% -4447.8 -884.8 -2899.8 -2988.7 -660.0 -737.3
30% -5523.5 -842.9 -2180.7 -2232.4 -604.1 -648.7
50% -5987.8 -874.9 -1476.7 -1506.7 -492.9 -518.1
70% -5188.3 -1377.1 -854.7 -869.4 -345.2 -357.3
90% -2351.1 -982.1 -227.2 -230.8 -149.1 -144.5
Average -4699.7 -992.4 -1527.8 -1565.6 -450.3 -481.2
Preferred model under each case is indicated with bold.
82
Table 2-36. Mean AIC, true model is HNB (disperison=2)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% 8901.6 1777.6 5811.7 5989.4 1334.1 1488.6
30% 11053.0 1693.9 4373.5 4476.7 1222.3 1311.5
50% 11981.6 1757.9 2965.4 3025.5 999.9 1050.3
70% 10382.6 2762.3 1721.5 1750.8 704.4 728.6
90% 4708.2 1972.2 466.4 473.7 312.3 303.0
Average 9405.4 1992.8 3067.7 3143.2 914.6 976.4
Preferred model under each case is indicated with bold.
Table 2-37. Mean Loglikelihood, true model is HNB (disperison=4)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% -3198.9 -770.4 -1752.3 -1847.4 -631.2 -713.2
30% -4810.3 -801.4 -1375.4 -1430.6 -579.9 -628.5
50% -5774.9 -1010.5 -1002.8 -1034.0 -479.2 -506.8
70% -4991.6 -1250.9 -564.1 -579.6 -333.1 -346.5
90% -2397.6 -940.3 -181.6 -185.4 -148.4 -143.0
Average -4234.7 -954.7 -975.2 -1015.4 -434.3 -467.6
Preferred model under each case is indicated with bold.
Table 2-38. Mean AIC, true model is HNB (disperison=4)
P
(zero) Poisson NB HP ZIP HNB ZINB
10% 6403.9 1548.9 3516.5 3706.8 1276.5 1440.4
30% 9626.6 1610.7 2762.8 2873.2 1173.7 1271.0
50% 11555.8 2028.9 2017.6 2080.1 972.5 1027.7
70% 9989.7 2509.9 1140.2 1171.2 680.3 707.1
90% 4801.3 1888.7 375.3 382.8 310.9 300.1
Average 8475.5 1917.5 1962.5 2042.8 882.8 949.3
Preferred model under each case is indicated with bold.
83
Table 2-39. Mean AIC of HNB, true model is HNB
P
(strutural zero) D=0.5 D=1 D=2 D=4
Average
10% 1442.5 1388.3 1334.1 1276.5 1360.3
30% 1298.6 1260.8 1222.3 1173.7 1238.8
50% 1061.9 1031.5 999.9 972.4 1016.4
70% 734.7 717.5 704.4 680.3 709.2
90% 323.5 315.5 312.3 310.9 315.6
Average 972.3 942.7 914.6 882.8 928.1
Table 2-40. Relative Bias for E(Y|X), true model is HNB (dispersion=0.5)
P
(zero) Poisson NB ZIP ZINB HP HNB
10% 0.967 0.963 0.196 0.478 0.031 -0.100
30% 0.960 0.956 0.270 0.493 0.007 -0.173
50% 0.951 0.947 0.321 0.506 -0.094 -0.316
70% 0.934 0.933 0.295 0.458 -0.493 -3.E+53
90% 1.261 1.267 -0.024 0.250 -168.57 -2E+106
Average 1.014 1.014 0.212 0.437 -33.82 -4E+105
Preferred model under each case is indicated with bold.
Table 2-41. Relative Bias for E(Y|X), true model is HNB (dispersion=1)
P
(zero) Poisson NB ZIP ZINB HP HNB
10% 0.972 0.967 0.231 0.504 0.090 0.013
30% 0.966 0.961 0.310 0.519 0.057 -0.053
50% 0.958 0.953 0.365 0.532 -0.076 -0.183
70% 0.947 0.942 0.353 0.493 -0.414 -1.2E+40
90% 1.248 1.243 0.087 0.290 -1305.27 -1.5E+113
Average 1.018 1.013 0.269 0.468 -261.12 -3.1E+112
Preferred model under each case is indicated with bold.
84
Table 2-42. Relative Bias for E(Y|X), true model is HNB (dispersion=2)
P
(zero) Poisson NB ZIP ZINB HP HNB
10% 0.975 0.970 0.230 0.484 0.097 0.065
30% 0.971 0.965 0.331 0.510 0.078 0.019
50% 0.963 0.957 0.390 0.526 -0.049 -0.095
70% 0.954 0.948 0.384 0.494 -0.359 -9.5E+49
90% 1.219 1.207 0.125 0.258 -8.849 -6.6E+120
Average 1.016 1.009 0.296 0.454 -1.816 -1.3E+120
Preferred model under each case is indicated with bold.
Table 2-43. Relative Bias for E(Y|X), true model is HNB (dispersion=4)
P
(zero) Poisson NB ZIP ZINB HP HNB
10% 0.977 0.971 0.217 0.435 0.086 0.075
30% 0.973 0.966 0.334 0.478 0.076 0.045
50% 0.966 0.959 0.404 0.507 -0.032 -0.053
70% 0.957 0.950 0.404 0.482 -0.313 -7.3E+51
90% 1.192 1.177 0.154 0.257 -5.982 -5E+124
Average 1.013 1.004 0.302 0.432 -1.233 -1E+124
Preferred model under each case is indicated with bold.
Table 2-44. Relative Bias for E(Y|X) of HNB model, true model is HNB
P
(zero) D=0.5 D=1 D=2 D=4
Average
10% -0.100 0.0138 0.065 0.075 0.013
30% -0.173 -0.053 0.019 0.045 -0.040
50% -0.316 -0.183 -0.095 -0.05 -0.161
70% -3E+53 -1E+40 -9.5E+49 -7.3E+51 -8.9E+52
90% -2E+106 -1E+113 -6E+120 -5E+124 -1E+124
Average -4E+105 -3E+112 -1E+120 -1E+124 -2E+123
85
Table 2-45. Observed and predicted zero observations, true model is HNB(dispersion=0.5)
Table 2-46. Observed and predicted zero observations, true model is HNB(dispersion=1)
P
(zero) Obs Poisson NB HP ZIP HNB ZINB
10% 25 53 59 25 65 25 62
30% 75 58 89 75 99 75 96
50% 125 60 125 125 138 125 136
70% 175 71 158 175 180 175 179
90% 225 116 192 225 226 225 226
Table 2-47. Observed and predicted zero observations, true model is HNB(dispersion=2)
P
(zero) Obs Poisson NB HP ZIP HNB ZINB
10% 25 44 61 25 58 25 63
30% 75 47 92 75 94 75 96
50% 125 54 128 125 135 125 135
70% 175 63 165 175 179 175 179
90% 225 110 193 225 226 225 225
P
(zero) Obs Poisson NB HP ZIP HNB ZINB
10% 25 58 58 25 70 25 63
30% 75 61 86 75 102 75 97
50% 125 66 123 125 139 125 136
70% 175 73 148 174 180 174 179
90% 225 114 180 225 226 225 226
86
Table 2-48. Observed and predicted zero observations, true model is HNB(dispersion=4)
P
(zero) Obs Poisson NB HP ZIP HNB ZINB
10% 25 13 22 25 24 25 24
30% 75 41 51 75 74 75 74
50% 125 73 81 125 125 125 124
70% 175 108 115 175 174 175 174
90% 225 174 186 225 225 225 225
Table 2-49. Average AIC statistics across all the models
ZIP HP ZINB HNB
ZIP 688.1 833.0 4342.9 4468.6
HP 699.4 746.8 4347.5 4398.9
ZINB 692.2 835.2 808.8 987.9
HNB 702.8 753.2 816.2 928.1
Table 2-50. Average Relative Bias across all the models
ZIP HP ZINB HNB
ZIP -0.431 0.308 -1.232 0.269
HP -1.779 -0.944 -1.172 -74.490
ZINB -0.351 0.320 -1.90E+11 0.448
HNB -2.59E+146 -1.76E+14 -4.00E+141 -2.50E+123
87
CHAPTER 3
A TRIPLE HURDLE COUNT DATA MODEL OF MARKET PARTICIPATION AND
CONSUMPTION
Background
In empirical economics, there has long been interest in modeling consumers’ behaviors,
in particular consumers’ preferences and purchases, and analyzing and predicting market
structure. When analyzing consumption behavior on the individual level, researchers frequently
find themselves working with count data, especially when collecting primary data using survey
instruments. Count data can be found when measuring consumption frequency or intensity
during a certain time period. Data on the frequency of purchase of food products in a given
period may provide unique challenges, as one might find many observations recorded as zero-
consumption. For example, if consumers are asked “How often did you consume blueberries last
month?”, there may be many respondents who answer that they did not consume in the past
month (hence a zero observation). Data with excess zero observations is referred to as having the
issue of zero-inflation.
Though there are models to deal with zero-inflation, for consumption data, there is an
interesting issue, as the zero observations may be a result of different reasons. One reason for a
zero observation is that some individuals have a non-positive desire for the product. For some
permanent reason, these individuals will not be consumers of the product (i.e. they are allergic).
However, a different reason for a zero observation is that some individuals have a positive desire
for the product, but they do not consume for some temporary reason (i.e. they may not be able to
afford the good at the given price). In this case, zero consumption is the corner solution for the
individual’s utility-maximizing decision. Similarly, some individuals might have a positive
desire to consume the product, but not during the recorded period (i.e. past month) due to
infrequent or seasonal consumption, a common issue with fresh fruits and vegetables. Individuals
88
who show zero consumption in the first case are non-consumers, and these zero observations are
structural zeros. The individuals who have positive desire to consume, but were observed
consuming zero units because of the second and third reasons are potential consumers, and the
corresponding zero observations are sampling zeros. The particular interpretations given to these
zero consumption observations can have a crucial bearing on the estimation techniques, and the
interpretation of market segmentation.
In survey research, both non-consumers (those who do not have positive participation
desire) and potential consumers (those who have positive market participation but choose to
consume zero units) are observed to have zero-consumption and are often treated as one in
modeling. However, the decision of market participation would be driven by a structurally
different process than the subsequent consumption decision and consumption intensity decision
for potential consumers compared to non-consumers. Thus, analyzing the different factors
influencing consumers’ participation and consumption decisions will provide researchers,
retailers, and producers with a better understanding of consumer behaviors. This can only be
achieved by modeling these decisions separately.
Existing analyses of market participation and consumption are mostly based on a
“double-hurdle” modeling approach. The double-hurdle approach assumes that the process of
generating zero-consumption is handled separately from the process of generating positive
consumption. However, it fails to distinguish between potential consumers and non-consumers,
which are all observed at zero-consumption. It is possible that marketing strategies developed on
these models to target non-consumers might exert different influences on non-consumers and
potential consumers. To address these limitations, this paper presents a “triple-hurdle” count data
model which allows us to observe participation intention in the first hurdle, and conditional on
89
the participation decision, consumers would further make the subsequent consumption and
consumption intensity decisions (allowing for positive participation but zero consumption
decisions). This model is used to classify three types of consumers in the market: non-
consumers, potential consumers, and consumers, and explores the appropriate structurally
different reasons explaining the three groups market participation, consumption, and
consumption intensity in sequence.
The Econometric Modeling of Count Data for Consumption Behavior
When dealing with the problem of “excess zeros”, a variety of statistical techniques have
been proposed and applied in economic literature. One of the most widely used is the Tobit
model (Tobin, 1958). It was developed to account for the limited capacity of simple linear
regression in the presence of a preponderance of zero observations. However, the Tobit model
assumes zeros represent censored values of an underlying normally distributed latent variable
that theoretically includes negative values. This results in a restrictive model that assumes all
zero observations are structural zeros resulting from the same generating process (there is no
allowance for the possibility of sampling zeros). The model is also restrictive by assuming that it
is the same set of factors that influence both consumers’ desire and acquisition. To solve the
shortcoming of the Tobit model, a number of generalizations to the Tobit model have been
developed.
The most popular generalizations of the Tobit model are Heckman’s sample selection
model and the double-hurdle model. When modeling consumption behavior, given the different
reasons causing zero-consumption, these models assume that individuals must pass two stages
before being observed with a positive level of consumption: a participation decision and a
consumption decision. The difference between the double-hurdle and sample selection models is
90
in the assumption of dominance: whether the participation decision dominates the consumption
decision.
In the double-hurdle model, there is an assumption that positive consumption is observed
only when consumers have overcome both stages. The observed consumption variable is given
by y=dy* where d is the indicator for a consumers’ desire to participate, and y* is the indicator
for the consumers’ determination on the consumption level. In order to observe a positive
consumption level, both d and y* must be positive. In Heckman’s sample selection model, there
is an assumption that the participation decision dominates the consumption decision, which
implies that all zero observations are structural zeros, and zero consumption does not arise from
a standard corner solution. To express the dominance using the equation, it implies that
p(y*>0|d=1)=1.
One significant problem with the Tobit model and its generalizations are that they assume
the latent variable is normally distributed, and they are very sensitive to violations of the
assumption of normality (Arabmazer and Schmidt, 1982), thus the Tobit model and its
generalized models have significant restrictions when applied to the analysis of consumption
behavior.
When the dependent variable is in the format of count data, one of the most popular
regression techniques is the Poisson regression. Poisson regression is commonly used in
economics to model the number of events, for example, the frequency of consumption. However,
the Poisson model fails to provide an adequate fit when there exists the problem of “excessive
zeros”. The Poisson model has a basic assumption of mean-variance equality which is violated
when “excessive zeros” pull the mean towards zero. A number of modified Poisson regression
91
models have been developed to account for excess zeros, the most popular of which are zero-
inflated/modified Poisson models and Hurdle count models.
The zero-inflated Poisson (ZIP) model was proposed by Lambert in 1995. This zero-
inflated count data model assumes that zero observation come from two distinct sources:
“sampling zeros” and “structured zeros.” When applied to the analysis of consumption, there is
an assumption that zero-consumption can either be recorded when the consumer is genuine non-
participant (structural zero), or when the zero consumption is the corner solution of a standard
consumer demand problem (sampling zero).
Different from the zero-inflated count data model, the hurdle count model proposed by
Mullahy (1986) assumes that all zeros are sampling zeros. When applied to the analysis of
consumption, there is an assumption that individuals need to pass two stages before being
observed with a positive level of consumption: a participation decision and a consumption
decision. Furthermore, the Hurdle models assume participation dominant. Thus, all the zero
observations are assumed generated in the first stage (decision on whether to consume), and in
the second stage, consumption behavior is truncated at zero.
Shonkwiler and Shaw (1996) extended the Hurdle count data model by allowing zero
observations in both the first and second stages. Thus, in Shonkwiler and Shaw’s Double Hurdle
count data model, there are two mechanisms generating zero observations: zero observations can
either happen in the first stage by choosing not to consume, or in the second stage by choosing to
consume at zero frequency. In their research, Shonkwiler and Shaw applied the double hurdle
count data model to analyze recreation demand, and they classified people into three categories:
“user”, “potential user”, and “non-user”. They define a “user” as a person who is currently
consuming the product, a non-user as a person who has never consumed the product before, and
92
likely will not consume the product in the future, and a “potential-user” as a person who has ever
consumed the product before, but is not consuming the product in the given period. In
Shonkwiler and Shaw’s research, they also made connections between the zero-inflated count
data model to the double-hurdle count data model by laying out the probability mass function for
both models, and they concluded that the zero-inflated count data model and the double-hurdle
count data model are essentially the same. They both allow the zero observations generated from
two separate processes, allowing the zero observations to be either structural or sampling zeros.
However, although these previous studies assumed that the zero observations might be
generated by two distinct sources (non-participants and potential consumers), they fail to
differentiate between the two types of consumer segments. In other words, the previous research
still assumes consumers make two subsequent decisions on consumption intention and
consumption frequency (though consumption frequency can be chosen at zero).
This study contributes to the literature by proposing a triple hurdle count data model,
which allows us to differentiate between three different types of consumers: non-participants,
potential-consumers, and consumers, and explore the appropriate structurally different reasons
explaining consumers’ market participation intention, consumption intention, and consumption
intensity decision. The results of the triple hurdle count data model will provide more detailed
information that can be useful to better classify markets into three segments: non-participants,
potential consumers, and consumers.
Motivation
The consumption of fresh produce is influenced by many different factors, which can be
stable or unstable. For example, consumers might choose not to consume certain fruits or
vegetables because of allergies, taste preferences, or diet constraints. These factors are
considered stable factors, which cause consumers to virtually ignore that type of fresh produce in
93
their decision making. These consumers would be expected to have non-positive market
participation intention for specific item of produce, thus are considered non-consumers.
However, other consumers might be influenced by unstable reasons. One significant
unstable factor for the consumption of fresh produce is seasonality. The consumption of fresh
produce can change significantly in different seasons. This can be a result of decreased supply
and availability leading to changes in prices and/or origin of producers at different times of the
year. Even though some consumers might have positive participation intention for some fruits or
vegetables, they may still choose not to consume during the off-season because of the high price.
This is referred to as the corner solution of a standard consumption problem. These consumers
influenced by seasonality (price) would change their consumption behavior when the
circumstances differ, thus are considered as potential consumers.
Taking the consumption of fresh blueberry as an example, the total observed zero-
consumption per month is significantly higher in winter and lower in summer. However, when
we differentiate the observed zero-consumption into non-consumers (never purchase) and
potential consumers (purchased before but not in the last month), the number of non-consumers
appears to be comparatively stable over the year, while the number of potential consumers
changes significantly over the year (Figure 3-1).
Although both non-consumers and potential consumers report zero-consumption, it
would be inappropriate to treat all zero observations the same for fresh produce consumption.
The decision of market participation appears to be driven by a structurally different process than
the subsequent consumption decision, and consumption intensity decision. Analyzing the
different factors influencing consumers’ participation, consumption, and consumption intensity
decisions can provide researchers, retailers, and producers a better understanding of consumer
94
behaviors, thus help them to develop effective and separate promotion strategies targeting non-
participants, potential consumers, and current consumers. It can also be of use to policy makers
who might be interested in policies targeting increased consumption of certain healthy foods, like
fruits and vegetables.
Conceptual Framework
To develop the triple hurdle count data model, we first begin by outlining the existing
double-hurdle approach. Previous studies have theorized that the observed zero-consumption
could be driven by two different mechanisms: non-consumers (who have non-positive market
participation desire) and potential consumers (who have non-positive consumption intention
given positive participation desire). Although these studies allow for the idea that factors
influencing market participation could be different from the factors influencing consumption
decisions, they fail to observe the consumers’ actual participation desire. They are also restrictive
by assuming that it is the same mechanism that determines consumption intention and
consumption frequency decisions, which might not always be true.
In the triple hurdle count data model, we relax the restrictions of the double-hurdle
approach, and extend the framework to differentiate three types of consumers and allow three
different mechanisms to generate the consumers’ decisions on market participation, consumption
intention and consumption intensity.
The full triple hurdle data model specification can be represented as:
R = R (consumers’ characteristics, products’ characteristics, seasonal effect)
D = D (consumers’ characteristics, products’ characteristics, seasonal effect)
Y = Y (consumers’ characteristics, products’ characteristics, seasonal effect)
Where R is a binary indicator of whether the consumer has a positive desire to participate
in the market, D is a binary indicator of whether the consumer would have a positive
95
consumption intention in the given time period given positive desire to participate, and Y is
positive integers indicating consumption frequency/intensity.
Econometric Framework
In this section, we start by proposing the triple hurdle count data model with the three
stages independent of each other, then we further allow the three stages to be correlated. Next,
we outline the estimation strategy and discuss inference and interpretation of the results. We also
discuss the double hurdle approach for the purpose of comparison.
Triple Hurdle Count Data Model with Independent Stages
The triple hurdle count data model, a mixture of Poisson regression models, is an
extension of the hurdle count data model proposed by Mullahy (1984). Mullahy’s model (hurdle
count data model) included a market participation stage before the consumption stage. In the
triple hurdle model, instead, there are three stages identified: market participation, consumption
intention, and consumption intensity. Thus, the triple hurdle count data model involves three
latent equations to indicate the three stages in succession, with the first two equations having
binary outcomes indicating participation and consumption intention, and the third equation
having positive count outcome indicating consumption intensity. This splits the observations into
three regimes (non-participants, potential consumers, and consumers) that relate to potentially
three different sets of explanatory variables. Figure 3-2 is a diagram of the data generating
process involved in the triple hurdle count data model.
The model specification for the triple hurdle count data model is as follows Equation 3-1
to Equation 3-3:
Market participation stage
Pr (𝑅∗=r) = exp(−𝜃1)∗𝜃1
𝑟
𝑟! r=0,1,2,3….
96
R = {0 𝑖𝑓 𝑅∗ ≤ 0 1 𝑖𝑓 𝑅∗ > 0
(3-1)
Where R denotes the binary indicator of whether to participate or not (with R=0 for non-
participants, and R=1 for participants). R is related to a latent variable R* via the mapping: R=1
for R*>0 and R=0 for R*≤0. The latent variable R* represents the propensity for market
participation, specifically, we adopt the Poisson distribution2 for R*. 𝜃1 is the parameter for the
Poisson distribution, which can be parameterized as 𝜃1 = exp (𝑥′𝛽) ,where x is a vector of
covariates and 𝛽 is a vector of unknown coefficients.
Consumption intention stage
Pr (𝐷∗=d) = exp(−𝜃2)∗𝜃2
𝑑
𝑑! d=0,1,2,3……
D = {0 𝑖𝑓 𝐷∗ ≤ 0 1 𝑖𝑓 𝐷∗ > 0
(3-2)
Conditional on participation (R=1), consumers make the second decision on whether to
consume during a specific time period. Let D denote a second binary indicator of whether to
consume or not in the given period (with D=0 for non-consumption, and D=1 for positive
consumption), where D is also related to a latent variable D* via the mapping: D=1 for D*>0 and
D=0 for D*≤0. We also adopt the Poisson distribution for D*. 𝜃2 is the parameter for the Poisson
distribution of D*, which can be parameterized as 𝜃2 = exp (𝑧′𝛼), where z is a vector of
covariates that determine consumers’ second choice and 𝛼 is the corresponding unknown vector
of parameters. Furthermore, there is no requirement that x=z.
Consumption intensity stage
2 It is possible that R* could be a continuous variable and generated by other approaches, for example, R* could be
possibly distributed with Normal distribution, then Pr (R*≤0) = Φ(-𝑥′𝛽) where Φ is the cumulative distribution
function for the normal distribution. However, in order to derive the sample-selected hurdle count data model with
interdependence, we employ the Poisson regression for the latent variable R*.
97
Pr(Y*=y) = exp (−𝜃3)∗𝜃3
𝑦
1−exp (−𝜃3) y=1,2,3,4….. (3-3)
Conditional on consumption in the given period (D=1 and R=1), positive consumption
frequency is observed, and the consumption intensity is represented by a latent variable Y*
(Y*=1,2,3,…J) which is generated by a Poisson regression truncated at 0. 𝜃3 is the parameter for
the Poisson distribution of Y*. 𝜃3 could be parametrized as 𝜃3 = exp (𝑤′𝛾), where w is a vector
of covariates that determine consumers’ consumption intensity, and 𝛾 is the corresponding
unknown vector of parameters. In this stage, there is no requirement that w=z=x.
Accordingly, in order to observe a non-participant, it is required that R=0; to observe a
potential consumer, it is required jointly that the individual is a participant (R=1) that chooses
not consume in the given period (D=0); and to observed positive consumption, we require jointly
that the individual is a participant (R=1), and that they choose to consume a positive intensity
(D=1, and y*>0).
Under the assumption that the three stages are independent, the probability of an
individual being a non-participant is:
Pr (R=0| x) = Pr(R*≤0|x) = exp(-𝜃1) (3-4)
The probability of an individual being a potential-consumer is:
Pr (D=0|x,z) = Pr(R*>0) * Pr(D*≤0) = (1-exp (-𝜃1))*exp (−𝜃2) (3-5)
And the probability of observing positive consumption intensity, y, is:
Pr(Y=y|x,z,w) = Pr(R*>0)*Pr(D*≤0) *Pr(Y*=y) = (1-exp (-𝜃1))*(1-exp (−𝜃2))
× exp (−𝜃3)∗𝜃3
𝑦
1−exp (−𝜃3) y=1,2,3,…. (3-6)
In this way, given the independence of the three stages, the probability of observing a
non-participant is exp(-𝜃1), the probability of observing a potential-consumer is (1-exp (-
98
𝜃1))*exp (−𝜃2), and the probability of observing a positive consumption intensity is a
combination of the three separate processes. Note that this specification differentiates zero
observations into two different regimes coming from two different generating processes. The
first process selects the individuals who have positive desire and the second process generates
individuals who determines zero-consumption given positive participation desire
Once the full set of probabilities have been specified, for any given observation, i, the
sample-selected hurdle count data model has the following likelihood function in Equation 3-7:
f(R𝑖, D𝑖 , Y𝑖, | 𝜃1,𝜃2,𝜃3) =
[exp(−𝜃1)]1[R𝑖=0] *((1 − exp(−𝜃1)) ∗ (
[exp (−𝜃2)]1[D𝑖=0]
[(1−exp(−𝜃2))∗exp (−𝜃3)∗𝜃3
𝑦
1−exp(−𝜃3) ]
1[D𝑖=1]
))
1[R𝑖=1]
(3-7)
Where 𝜃1 = exp(𝑥′𝛽), and 𝛽 are the parameters on x in the first stage, 𝜃2 = exp (𝑧′𝛼),
and 𝛼 are the parameters on z in the second stage, and 𝜃3 = exp (𝑤′𝛾) with 𝛾 being the set of
parameters on 𝑤 in the third stage.
Triple Hurdle Count Data Model with Interdependence
The assumption that the three stages are not related is restrictive, it is quite plausible that
the three stages are related. To accommodate that we now extend the model to have the three
stages correlated, which requires the latent variables (D*, R*, Y*) follow a trivariate Poisson
distribution. The full observability criteria of observing the three types of consumers are as
follows:
A consumer is a non-participant if R=0, is a potential consumer if (R=1 and D=0) and is a
positive consumer with a positive consumption level y if (R=1, D=1, and y*=y), which translates
into the following expressions for the probabilities (Equation 3-8 to Equation 3-10):
Non-participants: Pr(R=0|x) = Pr(R*=0|x) (3-8)
99
Potential-consumers: Pr(D=0, R=1|x, z) = ∑ 𝑃𝑟(𝑅∗ = 𝑟,𝐷∗ = 0|𝑥, 𝑧)∞𝑟=1 (3-9)
Positive consumption: Pr(Y=y|x,z,w)= ∑ ∑ 𝑃𝑟(∞𝑟=1
∞𝑑=1 𝑅∗ = 𝑟, 𝐷∗ = 𝑑, 𝑌∗ = y|x, z, w)
where r=0,1,2….; d=0,1,2,….; y=1,2,3…. (3-10)
Considering the trivariate Poisson distribution with two-way covariance structure (𝑅∗, 𝐷∗
𝑌∗) ~TP (𝜃1, 𝜃2, 𝜃3, 𝜃12, 𝜃13, 𝜃23 ), which takes the form:
𝑅∗ = 𝑍1+𝑍12+𝑍13 (3-11)
𝐷∗ = 𝑍2+𝑍12+𝑍23 (3-12)
𝑌∗ = 𝑍3+𝑍13+𝑍23 (3-13)
Where 𝑍𝑖 ~Po(𝜃𝑖), i∈{1,2,3}, and 𝑍𝑖𝑗 ~Po(𝜃𝑖𝑗), i,j∈{1,2,3}, i<j. Then 𝑅∗ follows
marginally a Poisson distribution with parameter (𝜃1 + 𝜃12 + 𝜃13), 𝐷∗ follows marginally a
Poisson distribution with parameter (𝜃2 + 𝜃12 + 𝜃23), and
𝑌∗ follows marginally a Poisson distribution with parameter (𝜃3 + 𝜃13 + 𝜃23). (𝑅∗, 𝐷∗), (𝑅∗,
𝑌∗), and (𝐷∗, 𝑌∗) marginally follow the bivariate Poisson distributions as follows:
(𝑅∗, 𝐷∗) ~ BPoisson (𝜃1 + 𝜃13, 𝜃2 + 𝜃23, 𝜃12) with Cov(𝑅∗, 𝐷∗)= 𝜃12
(𝑅∗, 𝑌∗) ~ BPoisson (𝜃1 + 𝜃12, 𝜃3 + 𝜃23, 𝜃13) with Cov(𝑅∗, 𝑌∗) = 𝜃13
(𝐷∗, 𝑌∗) ~ BPoisson (𝜃2 + 𝜃12, 𝜃3 + 𝜃13, 𝜃23) with Cov(𝐷∗, 𝑌∗) = 𝜃23
Thus, given the general joint probability function of bivariate distribution for (X, Y)
~BP(𝜃1, 𝜃2, 𝜃0), where 𝜃0 is the covariance parameter between X and Y.
P(X=x, Y=y) = exp(-𝜃1 − 𝜃2 − 𝜃0) 𝜃1
𝑥
𝑥! 𝜃2
𝑦
𝑦! ∑ (𝑥
𝑖)
min(𝑥,𝑦)𝑖=0 (𝑦
𝑖)𝑖! (
𝜃0
𝜃1𝜃2)𝑖 (3-14)
(Johnson and Kotz, 1997)
And the trivariate Poisson distribution with two-way covariance structure (𝑅∗, 𝐷∗ 𝑌∗)
~TP (𝜃1, 𝜃2, 𝜃3, 𝜃12, 𝜃13, 𝜃23 )
Pr(𝑅∗ = r, 𝐷∗ = d, 𝑌∗ = y) = exp (−𝜃1 − 𝜃2 − 𝜃3 − 𝜃12 − 𝜃13 − 𝜃23)
100
× ∑ {(𝑟 − 𝑧12 − 𝑧13)! (𝑑 − 𝑧12 − 𝑧23)!(𝑧12,𝑧13,𝑧23)∈𝐶
× (𝑦 − 𝑧13 − 𝑧23)! 𝑧12! 𝑧13! 𝑧23!}−1
× 𝜃1𝑟−𝑧12−𝑧13𝜃2
𝑑−𝑧12−𝑧23𝜃3𝑦−𝑧13−𝑧23𝜃12
𝑧12𝜃13𝑧13𝜃23
𝑧23
(3-15)
Where the summation is over the set C∈ 𝑁3 defined as
C=[(𝑦12, 𝑦13, 𝑦23) ∈ 𝑁3: {𝑦12 + 𝑦13 ≤ 𝑥}∪ {𝑦12 + 𝑦23 ≤ 𝑦} ∪ {𝑦13 + 𝑦23 ≤ 𝑧} ≠ ∅]
(Karlis and Meligkotsidou , 2005)3 (3-16)
Under the assumption that the three stages are interdependent, the probability of an
individual being a non-participant is
Pr(R=0|x) = Pr(R*=0|x) = exp (- (𝜃1 + 𝜃12 + 𝜃13)) (3-17)
The probability of an individual being a potential-consumer is:
Pr(D=0, R=1|x, z) = ∑ 𝑃𝑟 (𝑅∗ = 𝑗,𝐷∗ = 0)∞𝑗=1
= exp(-(𝜃2 + 𝜃12 + 𝜃23)) – exp(− (𝜃1 + 𝜃13) −(𝜃2 + 𝜃23)−𝜃12)4 (3-18)
3 In the case of the trivariate Poisson distribution with two-way covariance structure, the variance-covariance matrix
of (𝑅∗, 𝐷∗ 𝑌∗) is as follows
(
𝜃1 + 𝜃12 + 𝜃13 𝜃12 𝜃13
𝜃12 𝜃2 + 𝜃12 + 𝜃23 𝜃23
𝜃13 𝜃23 𝜃3 + 𝜃13 + 𝜃23
)
Then the parameters of 𝜃𝑖𝑗 , i,j=1,2,3, i≠j, have the straightforward interpretation of being the covariance between
the each pair of the variables.
4 Pr (D=0, R=1) = ∑ 𝑃𝑟(𝑅∗ = 𝑗,𝐷∗ = 0)∞𝑗=1
= ∑ 𝑃𝑟(𝑅∗ = 𝑗,𝐷∗ = 0)∞𝑗=0 − 𝑃𝑟(𝑅∗ = 0,𝐷∗ = 0)
= 𝑃𝑟(𝐷∗ = 0) - 𝑃𝑟(𝑅∗ = 0,𝐷∗ = 0)
= exp(-(𝜃2 + 𝜃12 + 𝜃23)) – exp(− (𝜃1 + 𝜃13) −( 𝜃2 + 𝜃23)− 𝜃12)
101
And the probability of an individual being observed with positive consumption intensity
y is
Pr(Y = y|x, z, w)= ∑ ∑ 𝑃𝑟(∞𝑗=1
∞𝑘=1 𝑅∗ = 𝑗, 𝐷∗ = 𝑘, 𝑌∗ = y)
= exp(-(𝜃3 + 𝜃13 + 𝜃23)) −exp(−(𝜃1+𝜃12)−(𝜃3+𝜃23)−𝜃13)∗(𝜃3+𝜃23)𝑦
𝑦!
−exp(−(𝜃2+𝜃12)−(𝜃3+𝜃13)−𝜃23)∗(𝜃3+𝜃13)𝑦
𝑦!
+ exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝜃3
𝑦
𝑦! (3-19)
Where y=1,2,3,…. 5
Thus, under the assumption of interdependence of the three stages, the probability of
observing a non-participant is exp (- (𝜃1 + 𝜃12 + 𝜃13)) and the probability of observing a
potential-consumer is exp(-(𝜃2 + 𝜃12 + 𝜃23))– exp(−(𝜃1 + 𝜃13) −( 𝜃2 + 𝜃23)−𝜃12).
Considering the likelihood function, the parameters will be redefined as
(𝜃1, 𝜃2, 𝜃3, 𝜃12, 𝜃13, 𝜃23) in the case of interdependence, where 𝜃12, 𝜃13, 𝜃23 are the correlation
parameters between each pair of stages. A wald test of 𝜃𝑖𝑗 = 0 i,j∈{1,2,3}, i<j will be employed
5 In order to derive Pr(Y=y), we employ the marginal distribution of 𝑌∗~𝑃𝑜((𝜃3 + 𝜃13 + 𝜃23), and the marginal
distribution of (𝑅∗, 𝑌∗) ~ BPoisson (𝜃1 + 𝜃12, 𝜃3 + 𝜃23, 𝜃13) and (𝐷∗, 𝑌∗) ~ BPoisson (𝜃2 + 𝜃12, 𝜃3 + 𝜃13, 𝜃23)
Pr(Y = y)= ∑ ∑ 𝑃𝑟(∞𝑗=1
∞𝑘=1 𝑅∗ = 𝑗, 𝐷∗ = 𝑘, 𝑌∗ = y)
= Pr(𝑌∗ = y) − Pr(𝑌∗ = y, 𝑅∗ = 0) − Pr(𝑌∗ = y, 𝐷∗ = 0)
+ Pr(𝑌∗ = y, 𝐷∗ = 0, 𝑅∗ = 0)
= exp(-(𝜃3 + 𝜃13 + 𝜃23)) - exp(−(𝜃1+𝜃12)−(𝜃3+𝜃23)−𝜃13)∗(𝜃3+𝜃23)𝑦
𝑦!
−exp(−(𝜃2+𝜃12)−(𝜃3+𝜃13)−𝜃23)∗(𝜃3+𝜃13)𝑦
𝑦!
+ exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝜃3
𝑦
𝑦!
102
to test for the independence between each pair of stages. The likelihood function under
interdependence is as follows in Equation 3-20
f(R𝑖, D𝑖 , Y𝑖, | 𝜃1,𝜃2,𝜃3) =
[exp (− (𝜃1 + 𝜃12 + 𝜃13))]1[R𝑖=0]
*
(
(
[exp(−(𝜃2 + 𝜃12 + 𝜃23)) – exp(− (𝜃1 + 𝜃13) − (𝜃2 + 𝜃23) − 𝜃12)]
1[D𝑖=0]
[ exp(−(𝜃3 + 𝜃13 + 𝜃23)) −
exp(−(𝜃1 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃13) ∗ (𝜃3 + 𝜃23)𝑦
𝑦! exp(−(𝜃2 + 𝜃12) − (𝜃3 + 𝜃13) − 𝜃23) ∗ (𝜃3 + 𝜃13)
𝑦
𝑦!
+ exp(−𝜃1 − 𝜃2 − 𝜃3 − 𝜃12 − 𝜃13 − 𝜃23) ∗ 𝜃3
𝑦
𝑦! ] 1[D𝑖=1]
)
)
1[R𝑖=1]
(3-20)
Where 𝜃1 = exp(𝑥′𝛽), and 𝛽 are the parameters on x in the first stage, 𝜃2 = exp (𝑧′𝛼),
and 𝛼 are the parameters on z in the second stage, and 𝜃3 = exp (𝑤′𝛾) with 𝛾 being the set of
parameters on 𝑤 in the third stage.
Furthermore, from the likelihood model, we also calculate the expected probability of
observing different levels of consumption: the probability of observing a non-consumer is
expressed in Equation 3-21; and the probability of observing a potential consumer is expressed in
Equation 3-22
Pr (Non-consumer) = Pr(𝑅𝑖=0|𝑥𝑖)= exp (- (𝜃1i + 𝜃12 + 𝜃13)) (3-21)
Pr( Potential-consumer) =Pr(𝐷𝑖=0, 𝑅𝑖=1|𝑥𝑖, 𝑧𝑖)
=exp(-(𝜃2i + 𝜃12 + 𝜃23)) – exp(− (𝜃1i + 𝜃13) −(𝜃2i + 𝜃23)−𝜃12)
(3-22)
Marginal Effects and Interpreting Results
The overall effect of a given explanatory variable is determined by several different sets
of marginal effects. For example, marginal effects of an explanatory variable can be determined
on the probability of being “non-consumers” Pr(R=0), the probability of being a potential
103
consumer, and on the probabilities for different levels of consumption Pr(Y=j). Calculating
marginal effects for each stage of decisions allow for comparisons between non-consumers and
potential-consumers (which has been lacking from previous models).
The marginal effect of a dummy variable is calculated as the difference between the
probabilities given the dummy variable equals to 1 or 0. As for continuous variables, the
probability expressions provided for each consumer category can be found from the numerical
derivatives. Note that the explanatory variables in the three different stages might be not the
same. Thus the explanatory variable of interest may appear in only one or two of x, z and w, or
in all of them. For a continuous variable 𝑥𝑘 the marginal effect on the participation intention is in
Equation 3-23, which only relates to the explanatory variables in x, and is given by:
𝑀𝐸Pr(𝑅=0) =𝜕Pr (𝑅=0)
𝜕𝑥𝑘 = exp (- (𝜃1 + 𝜃12 + 𝜃13))*(-𝜃1)*𝛽𝑘 (3-23)
To derive the marginal effects on the overall probabilities for the sample-selected hurdle
count data model with interdependence, we need to partition the explanatory variables and the
associated coefficients as follows given the possible existence for only one or two of x, z and w:
𝑥′ = (𝑢′, �̃�′), 𝛽′ = (𝛽𝑢′, 𝛽′̃)
𝑧′ = (𝑢′, �̃�′), 𝛼′ = (𝛼𝑢′, 𝛼′̃)
and 𝑤′ = (𝑢′, �̃�′), 𝛾′ = (𝛾𝑢′, 𝛾′̃)
where u represents the common variables that appear in all the x, z, and w, with
associated coefficients as 𝛽𝑢, 𝛼𝑢 and 𝛾𝑢 for the participation intention, consumption intention,
and consumption frequency equations, respectively. �̃� denotes the distinctive variables that only
appear in the participation stage, with 𝛽 as the associated coefficients; similarly, �̃� and �̃� denote
the variables that only appear in the consumption decision stage and consumption frequency
stage, with 𝛼′ and 𝛾′as the associated coefficients, respectively.
104
In order to express the marginal effects for the entire model, the unique explanatory
variables are expressed as 𝑥∗′ = (𝑢′, �̃�′, �̃�′, �̃�′) and the associated coefficient vectors set for the
three stages are expressed as 𝛽∗′ = (𝛽𝑢′, 𝛽′̃, 0′, 0′), 𝛼∗′ = (𝛼𝑢
′, 0′, 𝛼 ′̃, 0′), and 𝛾∗′ = (𝛾𝑢′,
0′, 0′, 𝛾′̃).
The marginal effect of the explanatory variable vector 𝑥∗ on the consumption probability
is in Equation 3-24, which relates to the explanatory variables in both x and z:
𝑀𝐸Pr(𝐷=0|𝑅=1;𝑥,𝑧)= exp (- (𝜃2 + 𝜃12 + 𝜃23))*(-𝜃2)*𝛼∗- exp (- (𝜃1 + 𝜃2 + 𝜃12 + 𝜃23 +
𝜃13))*[- 𝜃1 ∗ 𝛽∗ − 𝜃2 ∗ 𝛼∗] (3-24)
The marginal effect of the explanatory variables on the positive level of consumption y
(y=1,2,…) is as follows in Equation 3-25
𝑀𝐸Pr(𝑌=𝑦|𝑅=1,𝐷=1;𝑥,𝑧,𝑤)
= exp ((− (𝜃3 + 𝜃13 + 𝜃23)) ∗ (−𝜃3) ∗ 𝛾∗
−exp(−(𝜃1 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃13) ∗ (𝜃3 + 𝜃23)
𝑦 ∗ 𝑦 ∗ (𝜃3 + 𝜃23)𝑦−1 ∗ 𝜃3 ∗ 𝛾∗
𝑦!
−(𝜃3 + 𝜃23)
𝑦 ∗ exp(−(𝜃1 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃13) ∗ (−𝜃1 ∗ 𝛽′ − 𝜃3 ∗ 𝛾∗)
𝑦!
−exp(−(𝜃2 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃23) ∗ (𝜃3 + 𝜃13)
𝑦 ∗ 𝑦 ∗ (𝜃3 + 𝜃13)𝑦−1 ∗ 𝜃3 ∗ 𝛾∗
𝑦!
−(𝜃3 + 𝜃13)
𝑦 ∗ exp(−(𝜃2 + 𝜃12) − (𝜃3 + 𝜃13) − 𝜃23) ∗ (−𝜃2 ∗ 𝛼′ − 𝜃3 ∗ 𝛾∗)
𝑦!
+exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝑦∗𝜃3
𝑦−1
𝑦!
+exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝜃3
𝑦∗(−𝜃1∗𝛽′−𝜃3∗𝛾∗−𝜃2∗𝛼′)
𝑦!
(3-25)
The marginal effects for the triple hurdle count data model with no interdependence are
calculated as above but with 𝜃13 = 𝜃23 = 𝜃12 = 0.
105
The standard errors of the marginal effects could be calculated by the Delta Method or
simulated asymptotic sampling techniques. Considering the complexity of the marginal effects,
the sampling technique is used in this case. To be more specific, we randomly draw θ (where θ is
the parameters in the Sample-selected Zero-inflated model) from MVN (θ̂ , 𝑣𝑎𝑟[̂θ]) 10,000
times, and for each draw we calculate the marginal effects based on Equation 3-23 to Equation 3-
25, and then calculate the standard errors. These empirical standard deviations of the simulated
marginal effects are the valid asymptotic estimates of the true marginal effects’ standard errors.
Comparing Triple Hurdle Count Data Model and the Double Hurdle Models
One goal of this study is to discuss the difference in insights gained when differentiating
potential consumers from non-consumers, and employing the triple hurdle model instead of the
double-hurdle approach.
The double-hurdle alternative is similar to the model proposed by Shonkwiler and Shaw
(1996), which assumes that the factors influencing consumption frEquency decision are the same
as the factors influencing the consumption intention.
In Shonkwiler and Shaw’s model, the probability of observing an non-participant is:
Prob (D = 0|x)= Pr(D*=0|x)=exp (- (𝜃1 + 𝜃12)) (3-26)
and the probability of observing an individual who is a potential consumer is:
Pr(D = 1, Y = 0|x, w) = ∑ 𝑃𝑟 (𝐷∗ = 𝑗,𝑌∗ = 0)∞𝑗=1 =
exp(-(𝜃2 + 𝜃12)) – exp(− 𝜃1 − 𝜃2 − 𝜃12) (3-27)
The probability of observing positive consumption frequency is:
Pr(D = 1, Y = y|x,w)= ∑ 𝑃𝑟 (𝐷∗ = 𝑗,𝑌∗ = y) =∞𝑗=1
exp(−𝜃2−𝜃12))∗((𝜃2+𝜃12)
𝑦
y! -
exp(−𝜃1−𝜃2−𝜃12))∗𝜃2𝑦
y! (3-28)
106
Where 𝜃1 = exp(𝑥′𝛽), 𝛽 are the parameters on x in the market participation stage, and
𝜃2 = exp (𝑤′𝛾) with 𝛾 are the set of parameters on 𝑤 in the consumption stage. In this case, there
is an assumption that the probability of consumption intention and the probability of
consumption frequency are related to the same explanatory factors (𝑤) in similar ways.
Although this double-hurdle approach is non-nested with the triple hurdle model, a
generalized likelihood ratio (LR) statistic could be used, with degrees of freedom being given by
the number of additional parameters estimated in the more general model. Additionally, in such a
non-nested situation, information based model selection criteria such as AIC and BIC are
appropriate for choosing between alternative models. These are given by AIC=-2ln(θ)+k, and
BIC=-2ln(θ)+(lnN)*k, where k is the total number of parameters estimated and ln(θ) is the
maximized log-likelihood function. The preferred model is that with smallest value.
Variables and Data
Data Set
In this chapter, the triple hurdle count data model was fit using an online survey about
consumers’ consumption behavior and preferences for fresh blueberries. The survey was
conducted with a random panel of respondents starting in September 2010 and lasted for 12
months, with approximately 350 participants recruited on a monthly basis. The target
respondents are primary grocery shoppers in the Eastern States of the United States. Respondents
answered a series of questions on how often and why (or why not) they purchase fresh
blueberries.
Here, for modeling purposes, non-consumers and potential consumers were distinguished
using survey design. Respondents were first asked whether they had ever purchased fresh
blueberries and then asked whether they had purchased fresh blueberries in the past month. For
those respondents who had purchased in the previous month, they were further asked to indicate
107
how many times they purchased fresh blueberries in the past month. Purchase information was
only asked for the past month to ensure accuracy of the data as it is difficult for people to recall
purchases more than one month ago. By asking respondents whether they have purchased fresh
blueberries before and whether they had purchased fresh blueberries last month in two questions,
“non-consumers” and “potential consumers” can be differentiated according to the definition of
the three types of consumers given above.
The questionnaire consisted of four parts. The first part included questions concerning
consumers’ frequency of consuming fresh blueberries. The second part focused on the reasons
for consuming (or not consuming) fresh blueberries. The third part of the questionnaire focused
on the consumers’ awareness of health benefits of eating fresh blueberries and the last part
includes socio-demographic variables, such as gender, age, educational level, employment status,
family size, socioeconomic status, etc.
Variables
The key dependent variables are ANYPARTICIPATE, ANYCONSUME, and
PURCHASEFREQ. The three variables are derived from the following questions, respectively:
“Have you ever purchased fresh blueberries?” (where a binary “Yes/No” answer is required);
“Have you purchased fresh blueberries in the LAST MONTH?”; and “In the last month,
approximately how many times did you purchase fresh blueberries?”. For the final question,
respondents selected an answer from the categories 1 or 2 times; 3 or 4 times; 5 or 6 times; more
than 6 times; and did not purchase (though they were not shown the question if they indicated no
purchase, this was used as a consistency check). Thus, the use of these three dependent variables
corresponds to examining a three-step decision made with respect to participation and
consumption. As noted earlier, by asking the three questions in sequence, it allows identification
of non-participants, potential consumers, and consumers.
108
The covariates employed in the model are shown in Table 3-1, together with their means.
In addition, descriptions of each variable, and whether the variable was employed in the
participation stage (P), consumption intention stage (C), and consumption frequency stage (F)
are also indicated in this table.
The individual characteristics include gender, education level, race, age and awareness of
health benefits of blueberries. In this dataset, only 35.7% of the respondents are male, which was
expected as only primary grocery shoppers for the household completed the survey. Education
level is controlled for with a binary dummy variable indicating whether the respondents have a
four-year college degree or not (40.6% of the participants have earned at least undergraduate
degree). Consumers’ awareness of health benefits of blueberries was controlled for by using a
dummy variable, which allows us to test the effectiveness of knowledge of health benefits on
consumption decisions. In this dataset, 51.9% of participants indicated that they were aware of
specific health benefits of blueberries.
Together with individual characteristics, household characteristics are also controlled,
including the number of people in the household, whether there are children living in the
household, household income, and household food budget per week. Both household income and
household food budget per week are included based on previous research that indicates income
level works as a social class proxy for consumption participation, and food budget works more
closely influencing consumption frequency.
The last set of variables is a ranking of how important the respondent finds different
attributes of blueberries, including price and taste. Since the consumption of blueberries changes
significantly over the year, we also include seasonal dummy variables.
109
Results
The estimated probabilities of different types of consumers for the fresh blueberry
consumption from both the Double-hurdle and Triple-hurdle approaches are presented in Table
3-2. The predicted probability of non-consumers of fresh blueberry from the triple-hurdle model
is 21.26% (compared to the observed percent of 20.06%), and the predicted probability of non-
consumers of fresh blueberry from the double-hurdle model is 19.83%; The estimated predicted
probability of potential consumers from the triple hurdle model is 27.03% (compared to the
observed percent of 26.84%) and the predicted probabilty of potential consumer from the double-
hurdle approach is 23.60%; The estimated predicted probability of consumers from the triple
hurdle model is 51.71% (compared to the observed percent of 53.10%), and the predicted
probabilty of consumer from the double-hurdle approach is 56.67%. This demonstrates that the
triple-hurdle approach has better prediction of the different type of consumers, specifically, the
double-hurdle approach tends to significantly underestimate the percentage of potential
consumers in the market.
Summary statistics from both the Double hurdle (DH) model and the Triple hurdle (TH)
model are presented in Table 3-3. The DH model is conditional only on X and W, which forces
the assumption that the same set of explanatory factors influence consumption intention and
consumption frequency. The TH model is conditional on X, Z, and W, which allows potential
consumers to be differentiated from consumers. The likelihood ratio statistics from the models of
fresh blueberry consumption reject the DH model. Furthermore, both the AIC and BIC
information criteria suggest the superiority of the TH model over the DH. We, therefore, focus
the discussion on results from the triple hurdle count data model but include discussion on the
insights gained from using the triple hurdle approach compared to the double hurdle approach.
110
Regression results of Triple Hurdle Count Data Model Results
Triple-hurdle count data model estimation results are displayed in Table 3-4 together with
the estimation results from the double-hurdle approach. Coefficient estimates for factors
associated with the probability of having a positive market participation intention (Stage 1) are
displayed in Column 1; coefficient estimates for factors associated with positive consumption
intention are displayed in Column 2; and coefficient estimates for factors associated with
positive consumption levels given positive market participation intention and consumption
intention are shown in Column 3.
First, focusing on the demographic characteristics, some were found to have consistent
results from both the DH and TH models. For example, both models indicate that females are
likely to purchase fresh blueberry more frequently than males and there is no significant effect
detected on market participation and purchase intention. Another variable, age is significantly
negatively correlated with market participation intention, consumption intention, and also
consumption intensity in the triple hurdle model, which indicates that younger people are more
likely to consume fresh blueberries and also more likely to consume them more frequently, and
the results from the double hurdle model indicate similar results. Weekly food budget is
significantly positive in the triple hurdle model across all the three stages, and similar results are
found from the double hurdle model, indicating that people with higher food budgets are more
likely to try fresh blueberries, more likely to purchase blueberries during grocery shopping, and
they will purchase fresh blueberries more often.
First, focusing on the demographic characteristics, some were found to have consistent
results from both the DH and TH models. For example, both models indicate that females are
likely to purchase fresh blueberry more frequently than males and there is no significant effect
detected on market participation and purchase intention. Another variable, age is significantly
111
negatively correlated with market participation intention, consumption intention, and also
consumption intensity in the triple hurdle model, which indicates that younger people are more
likely to consume fresh blueberries and also more likely to consume them more frequently, and
the results from the double hurdle model indicate similar results. Weekly food budget is
significantly positive in the triple hurdle model across all the three stages, and similar results are
found from the double hurdle model, indicating that people with higher food budgets are more
likely to try fresh blueberries, more likely to purchase blueberries during grocery shopping, and
they will purchase fresh blueberries more often.
Results differed between the two models for other demographic variables. Results from
the DH model indicate Caucasians are more likely to try fresh blueberries compared with other
races, but are likely to purchase fresh blueberries less often. The first part remains the same for
the TH model, with Caucasians are more likely to try fresh blueberries. However, results from
the TH find Caucasians less likely to purchase blueberries with no relationship to consumption
frequency. What is more, results from the double hurdle model indicate that Hispanics are less
likely to purchase fresh blueberries, while from the triple hurdle model, we find that Hispanics
are less likely to participate in the fresh blueberry market (thus have a higher probability of being
observed as non-consumer). The variable for Hispanic was not significantly related to
consumption intention nor consumption frequency decisions in the TH model. Education was not
significantly related to market participation or consumption frequency in the TH model, and was
negatively related to consumption intention. In the DH model, education was positively related
to participation, indicating those with more education were more likely to purchase fresh
blueberries. Income is statistically significantly positive in the double hurdle model on the
participation stage. However, no significant correlation between household income level and
112
fresh blueberry consumption is detected in any of the three stages in the triple hurdle model.
Considering consumers’ food habits, being vegetarian was found to be significantly and
positively correlated with fresh blueberry purchase in the double hurdle model, and was
significantly positively correlated with market participant intention in the TH, but negatively
correlated with consumption intention.
When looking at the household characteristics, from the double hurdle model, families
with children would be more likely to purchase fresh blueberry and no more likely than others to
purchase blueberries more frequently. This is similar to the TH model which finds consumers
who indicate they have children living in the household are more likely to participate in the fresh
blueberry market, yet there is no relationship with consumption intention nor purchase
frequency. The difference is in the distinction that the TH shows the relationship is in the
participation decision, not the consumption decision, while both agree there is no relationship
with consumption frequency. The number of people living in the household is not significant in
either of the two stages in the double hurdle model; while in the triple hurdle model, the
estimated coefficient of number of people living in the household is significantly negatively
correlated with market participation, indicating that households with larger family size would be
less likely to participate in the market.
The estimated results indicated that consumers who were aware of the health benefits of
blueberries are significantly more likely to purchase fresh blueberries and purchase with the
higher frequency in the double hurdle model. When looking at the results from the triple hurdle
model, we found that consumers’ awareness of health benefits only significantly influences the
decision of market participation, not consumption intention. Similar to the DH model, the TH
113
also indicates that consumers who are aware of health benefits would purchase fresh blueberries
more frequently.
Considering blueberry characteristics, results from the double hurdle model indicate that
participants who answer that taste is an important factor that influences their purchase decision
would be more likely to purchase the fresh blueberries and more likely to purchase at a high
frequency. However, from the triple hurdle model, we find that taste is only significantly
correlated with positive consumption intention and higher consumption frequency, not with
market participation. As for the price influence, the double hurdle model results indicate that
participants who consider price as important are less likely to purchase fresh blueberries, and
purchase fresh blueberries less frequently. In the TH model, results show that consumers who
thought price is important are less likely to have consumption intention for fresh blueberries, and
they are likely to purchase fresh blueberry less frequently, yet there is no significant influence on
consumers’ market participation intention. Moreover, the magnitude of the price coefficient is
found much larger in the consumption intention stage than the consumption frequency stage.
Finally, considering the effect due to the seasonality (the domestic blueberry season in
the United States runs from April to late September), the results from the double hurdle model
indicate that consumers are more likely to purchase fresh blueberries at a higher frequency in
Summer and Fall compared to Winter, with no significant impact on frequency of consumption
between Spring and Winter. Purchase intention is only impacted during Spring, where consumers
are less likely to purchase fresh blueberries. In the triple hurdle model, consumers were more
likely to have a positive market participation intention, purchase intention and purchase at a
higher consumption frequency in Summer compared to Winter. In Fall, although consumers are
more likely to participate in the market, and purchase the fresh blueberries, there is no significant
114
effect detected on consumption frequency. In Spring, consumers are less likely to purchase fresh
blueberries compared to Winter, with no significant effect on market participation intention and
consumption frequency. With the TH model, it can be observed that consumers are more likely
to purchase and purchase at a higher frequency during the peak season. From the DH model, we
see an increase in consumption frequency during peak season, but not an impact on participation.
Marginal Effects of the Triple Hurdle Count Data Model
As previously mentioned, one of the advantages of the Triple Hurdle Count Data model
is to introduce a degree of flexibility and explore the different generating processes of the three
different types of consumers. This is most easily demonstrated by examining the marginal effects
of different variables. Marginal effects on zero observations using a triple hurdle count data
model, compared with the results from the double hurdle count data model are shown in Table 3-
5. For the triple hurdle model, the overall marginal effect on Pr(y=0) was divided into two parts:
the effect on non-participation (Pr(r=0)), and the effect on the participation with zero
consumption Pr (r=1, d=0). In Table 3-6, marginal effects on the unconditional probabilities of
positive levels of consumption (y=1, 2, 3, 4), using a triple hurdle model versus the double
hurdle model are shown.
The marginal effects (shown in Table 3-5 and Table 3-6) highlight some interesting
results. One example in the case of the fresh blueberry consumption is the impact of the variable
representing if a respondent feels price is an important factor when choosing blueberries. When
examining the influence of the price factor in the Triple Hurdle Count Data model, we see that its
dominant effect is on the probability to be a potential consumer (by 0.178), and that there is no
relationship between the importance of price and the probability to be a non-participant. Thus,
we conclude that when price is identified as an important factor, the likelihood to be a potential
consumer is higher, while the likelihood to be a non-consumer is unaffected. This is as expected
115
as a high price might stop someone interested in purchasing from making that purchase decision,
however a non-consumer is expected to be in their category because of more permanent reasons
(such as allergies). A similar effect is found for the taste factor of blueberries, we see that taste
only influences the probability to be a potential consumer (those that say taste is important are
less likely to consume zero; however, taste does not influence the likelihood of being a non-
consumer. When compared with the double-hurdle approach, the double-hurdle model found that
both the price and taste significantly influenced consumers market participation decisions.
However, the magnitude of the estimated effects on consumption intention from the double-
hurdle approach is much smaller than those found from the triple-hurdle model.
Another example is consumers’ awareness of health benefits. From both the triple hurdle
and double hurdle model, we see that consumers’ awareness of health benefits is only
significantly correlated with consumers’ participation decision. This implies that being aware of
the benefits of blueberries influences the likelihood to try blueberries, as well as the likelihood to
consume more frequently, but does not impact the likelihood to be a potential consumer (once a
consumer decides to participate, they are as likely to be a participant or not regardless of their
awareness of health benefits, but if they do participate and consume, they are likely to consume
more often).
As for the seasonal variables, the triple-hurdle model found that consumers would more
likely to significant seasonal effects for both market participation and consumption intention
decision, but the double-hurdle model indicates no significant correlation between different
seasons and market participation decision.
In summary, from the comparison, we found that the triple-hurdle model introduces more
detailed information concerning the inferences about market participation and consumption
116
decisions than the more restrictive alternative approach. The added information allows us to
distinguish between factors associated primarily with market participation from those primarily
associated with consumption decisions, and factors associated with both decisions.
Conclusion
This study examined the factors associated with consumers’ decision making, using fresh
blueberries as an example. The model presented in this paper was designed to allow three distinct
decisions by consumers: market participation(whether to participate in the market or not);
consumption participation (whether to consume during the certain time period or not); and
consumption frequency (how many/much to consume during the period). By distinguising these
groups the market can be segmented into three different segments of consumers based on the
three decisions: non-participants (those who do not have market participation intention to
participate in the product market); potential-consumers (those who have positive market
participation intention, but are not willing to purchase the product during a given period); and
consumers (those who have been observed with positive consumption during a certain period).
These three segments of consumers are generated by structurally different generating processes,
and therefore a triple hurdle count data model was developed to account for each decision. The
triple hurdle count data model differs from previous models by capturing the different generating
processes of non-consumers and potential consumers. This approach facilitates improved
inference because it accounts for the fact that market participation might be driven by a different
structural process than consumption decisions. This triple hurdle count data model should be
useful in many other applications when there exist non-homogeneous decision-making
processes, and when the output is in the form of count data.
To compare the triple hurdle count data model to the models commonly used to examine
participation and consumption decisions in the market, we also used an extended version of the
117
double hurdle count data model which is similar to Bellemare and Barrett (2006) that is nested in
our triple hurdle count data model as a comparison. The likelihood ratio test, as well as model
selection criteria (AIC and BIC) find that the triple hurdle count data model is preferred
statistically to the double hurdle model. What is more, regarding the model results, a number of
differences are highlighted between the two approaches.
To demonstrate the differences in these two approaches, we applied both the double
hurdle and triple hurdle models to the case of fresh blueberry consumption. The application of
the triple hurdle model to the consumption of fresh blueberries highlights a strong relationship
between consumers’ knowledge and awareness of the health benefits of blueberries and market
participation and consumption frequency, but no significant correlation with the consumption
intention. These results suggest that advertisements and claims of the health and nutrition
benefits of fresh blueberries would be significantly important if policymakers intended to
promote fresh produce consumption, especially to encourage more non-consumers to participate
in the market and start trying fresh produce (in this case, blueberries).
Moreover, the triple hurdle results also indicate that consumer perceptions towards
blueberry characteristics, such as price and taste, only have strong correlation with the
consumption participationand consumption frequency decisions, but no significant effect on
market participation decision, which indicates that improvements in taste in the product will
likely to stimulate potential consumers to consume, and to encourage the consumers to purchase
more, however, it will not significantly change the behavior of non-consumers. Moreover, it also
indicates that the diccounted price or promotions of the product will likely lead to increased
consumption quantity, but not an increased quantity of consumers in the market.
118
Overall, this study demonstrates that there are different factors influencing non-
consumers and potential consumers, thus emphasizing the contribution of using the triple hurdle
count data model. The reasons behind non-consumers are mostly stable demographic variables
like ethnicity, age, income level, family characteristics and consumers’ knowledge of health
information, which will not change quickly. The reasons behind potential consumers are more
related to product characteristics, like taste and price.
119
Figure 3-1. Zero consumption of fresh blueberry per month (created by author)
General consumer sample
Stage 1
Non-participants Market Participants
Stage 2
Potential-consumer Consumers
Stage 3
Consumption Frequency
Figure 3-2. Diagram of the data generating process of the Triple Hurdle Count Data model
(created by author)
0
50
100
150
200
250N
um
ber
of
Obse
rvat
ions
Month
Zero Consumption of Fresh Blueberry
Non-consumer Potential-consumer Zero-consumption
120
Table 3-1. Variable Descriptions
Variables Description Value Model
Male Percent of sample male 35.7% P/C/F
College Percent of sample with at least four-
year college degree
40.6% P/C/F
Age Age in years (continuous in analysis) 18-24 years 13.9% P/C/F
25-29 years 11.1%
30-34 years 10.8%
35-39 years 5.6%
40-44 years 8.2%
45-49 years 8.7%
50-54 years 11.5%
55-59 years 9.4%
60-64 years 9.2%
65 or above 11.6%
Income Estimated Household income $14,999 or less 11.2% P/C/F
$15,000-$24,999 13.5%
$25,000-$34,999 14.7%
$35,000-$49,999 17.4%
$50,000-$74,999 21.0%
$75,000-$99,999 11.6%
$100,000 or
above
10.6%
Hispanic Percent Hispanic 4.0% P/C/F
Black Percent Black/African American 10.1% P/C/F
Asian Percent Asian 3.2% P/C/F
White Percent White 82.3% P/C/F
Otherrace Percent other races 0.4% P/C/F
Health_Aware Percent who are aware of health
benefits of blueberry
51.9% P/C/F
Budget Food budget per week Less than $49 11.5% P/C/F
$50-99 36.1%
$100-149 28.9%
$150-199 13.4%
$200-$249 5.9%
$250+ 4.2%
WithChild Percent who indicate have children live
in the household
34.5% P/C/F
Peop_number People number in the house(continouse
in the analysis)
1-2 55.0% P/C/F
3-4 34.6%
5-6 8.8%
7-8 1.3%
9 or above 0.3%
Taste Percent who indicate taste as a reason
for eating/not eating blueberries
55.2% C/F
121
Table 3-1. Continued
Variables Description Value Model
Price Percent who indicate price as a reason for
eating/not eating blueberries
55.0% C/F
Spring Season Dummy for Spring 23.4% C/F
Summer Season Dummy for Summer 23.9% C/F
Fall Season Dummy for Fall 27.2% C/F
122
Table 3-2. Estimated probabilities for fresh blueberry consumption
Observed Double-hurdle
approach
Triple-hurdle
approach
Non-consumers 20.06% 19.83% 21.26%
Potential consumers 26.84% 23.60% 27.03%
Consumers 53.10% 56.67% 51.71%
Table 3-3. Fresh blueberry consumption: summary statistics from double hurdle approach and
triple hurdle model
Fresh Blueberry Consumption
DH TH
N 4038 4038
K 39 60
Loglikelihood -5237.126 -5064.679
AIC 10513 10189
BIC 10798 10627
LR:TH versus DH 344.894***(df=21) (***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model
with regard to each information criteria is indicated with bold.
123
Table 3-4. Fresh blueberry consumption: regression results
Triple-Hurdel Count Data Model Double-Hurdle Count Data
Model
Explanatory
Variables
Stage 1
Participation
Intention
Stage 2
Consumption
Intention
Stage 3
Consumption
Intensity
Stage 1 &
Stage 2
Participation
Intention
Stage 3
Consumption
Intensity
Female 0.064
(0.057)
0.026
(0.028) 0.430
(0.134)***
-0.063
(0.042) 0.074
(0.030)***
Caucasian 0.340
(0.161)**
-0.178
(0.083)**
-0.082
(0.307) 0.212
(0.104)**
-0.132
(0.068)** Hispanic -1.062
(0.304)**
0.171
(0.103)
-0.681
(0.450) -0.241
(0.118)**
-0.112
(0.080)
Asian -0.071
(0.194)
0.254
(0.133)
-0.700
(0.504)
0.080
(0.150)
0.115
(0.088)
Black 0.141
(0.166)
0.033
(0.089)
-0.116
(0.341) -0.219*
(0.113)
0.104
(0.073)
College 0.003
(0.061) -0.091
(0.026)*
0.165
(0.138) 0.118
(0.043)***
-0.024
(0.030)
Health_Aware 0.812
(0.065)***
0.021
(0.031) 1.234
(0.282)***
0.734
(0.042)***
0.210
(0.031)***
Age -0.055
(0.010)**
-0.032
(0.005)**
-0.123
(0.028)**
-0.017
(0.007)***
-0.036
(0.005)***
Income 0.020
(0.011)
0.004
(0.005)
0.035
(0.032) 0.029
(0.009)***
0.008
(0.007)
Food budget 0.126
(0.019)***
0.090
(0.011)***
0.198
(0.028)***
0.050
(0.015)***
0.104
(0.009)***
Peop_number -0.167
(0.051)**
-0.053
(0.029)
-0.110
(0.097)
-0.028
(0.038)
-0.010
(0.025)
With_child 0.241
(0.074)***
-0.059
(0.038)
0.121
(0.151) 0.142
(0.058)***
0.001
(0.038)
Vegetarian 0.503
(0.181)***
-0.231
(0.042)**
-0.283
(0.354) 0.244
(0.125)*
-0.019
(0.067)
Spring -0.124
(0.085) -0.093
(0.035)**
-0.263
(0.255) -0.103
(0.057)*
0.055
(0.045)
Summer 0.399
(0.082)***
0.422
(0.039)***
0.583
(0.187)***
0.006
(0.056) 0.459
(0.040)***
Fall 0.115
(0.074)***
0.206
(0.036)**
0.026
(0.231)
-0.076
(0.054) 0.236
(0.041)*** Price -0.067
(0.059) -0.443
(0.030)**
-0.190
(0.126)**
-0.178
(0.040)***
-0.293
(0.028)***
Taste -0.062
(0.051) 0.589
(0.023)***
1.155
(0.534)***
0.131
(0.041)***
0.589
(0.036)***
Constant 0.694
(0.233)
0.694
(0.107)***
-3.029
(0.753)**
-0.032
(0.143)
-0.159
(0.105)
124
Table 3-4. Continued
Triple-Hurdel Count Data Model Double-Hurdle Count Data
Model
Rho(1,2) -0.794(0.055)***
Rho(1,3) 1.092(0.044)*** -0.112(0.022)***
Rho(2,3) 0.140(0.033)***
# of obs 4038 4038
5237.126 Log-
Likelihood
5064.679
(***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model
with regard to each information criteria is indicated with bold.
125
Table 3-5. Marginal Effects for Triple Hurdle Count Data Model
(***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model
with regard to each information criteria is indicated with bold.
Pr(Non-consumer) Pr(Potential-consumer)
Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle
Pr(R=0) Pr(R=0) Pr(D=0,R=1) Pr(D=0,R=1)
Female -0.014
(0.013)
0.018
(0.012)
-0.009
(0.012) -0.025***
(0.009) Caucasian -0.075**
(0.036)
-0.061
(0.029)***
0.083***
(0.035)
0.054
(0.021)
Hispanic 0.234***
(0.064)
0.069***
(0.034)
-0.104***
(0.044)
0.010
(0.025)
Asian 0.016
(0.044)
-0.023
(0.043) -0.106*
(0.054)
-0.024
(0.028)
Black -0.031
(0.037) 0.063*
(0.032)
-0.009
(0.037) -0.047***
(0.023) College -0.001
(0.014) -0.034***
(0.012)
0.037***
(0.011)
0.016*
(0.009) Health_Aware -0.180***
(0.016)
-0.210***
(0.010)
0.018
(0.013)
0.005
(0.008)
Age 0.012***
(0.002)
0.005***
(0.002)
0.011***
(0.002)
0.008***
(0.001) Income -0.004
(0.002) 0.008***
(0.003)
-0.001
(0.002)
0.000
(0.002)
Food budget -0.028***
(0.004)
-0.014***
(0.004)
-0.033***
(0.005)
-0.024***
(0.003) Peop_number 0.037***
(0.011)
0.008
(0.011)
0.016
(0.012)
0.001
(0.008)
With_child -0.053***
(0.017)
-0.041***
(0.017)
0.032*
(0.016)
0.011
(0.012)
Vegetarian -0.111***
(0.041)
-0.070*
(0.036)
0.110***
(0.022)
0.026
(0.021)
Spring 0.027
(0.019) 0.030*
(0.016)
0.034***
(0.015)
-0.024*
(0.013) Summer -0.088***
(0.017)
-0.002
(0.016) -0.159***
(0.017)
-0.125***
(0.012) Fall -0.047***
(0.017)
0.022
(0.015) -0.036***
(0.015)
-0.071***
(0.012) Price 0.015
(0.013) 0.051***
(0.011)
0.178***
(0.012)
0.065***
(0.009) Taste 0.014
(0.012) -0.037***
(0.012)
-0.241***
(0.012)
-0.149***
(0.009)
126
Table 3-6. Comparison of the marginal effects for Triple Hurdle Count Data Model and Double Hurdle Count Model
Pr(Y=1) Pr(Y=2) Pr(Y=3) Pr(Y=4)
Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle
Female -0.030*
(0.014)
-0.010*
(0.006)
-0.039***
(0.013)
0.009
(0.005) -0.041***
(0.013)
0.005**
(0.002)
0.132***
(0.013)
0.002
(0.005)
Caucasian -0.020
(0.036) 0.033**
(0.015)
-0.004
(0.031)
-0.013
(0.012)
0.005
(0.030)
-0.009
(0.005)
0.011
(0.030)
-0.004
(0.012)
Hispanic 0.045
(0.046) -0.034**
(0.018)
0.070*
(0.042)
-0.030
(0.014)
0.064
(0.042) -0.011***
(0.006)
-0.309***
(0.042)
-0.004
(0.014)
Asian 0.127*
(0.058)
0.011
(0.022) 0.079*
(0.047)
0.023
(0.016)
0.066
(0.048)
0.010
(0.007) -0.182***
(0.049)
0.004
(0.016)
Black 0.026
(0.041) -0.033**
(0.017)
0.012
(0.034)
0.008
(0.013)
0.009
(0.034)
0.006
(0.006)
-0.007
(0.033)
0.003
(0.013)
College -0.039**
(0.014)
0.017***
(0.006)
-0.021
(0.014)
0.001
(0.005)
-0.016
(0.014)
-0.001
(0.002) 0.040***
(0.013)
-0.001
(0.005)
Health_ware -0.063**
(0.023)
0.106***
(0.005)
-0.113***
(0.024)
0.068***
(0.004)
-0.117***
(0.024)
0.023***
(0.002)
0.453***
(0.023)
0.007**
(0.004)
Income -0.001
(0.003)
0.004
(0.001)
-0.003
(0.003)
0.003
(0.001)
-0.003
(0.003)
0.001
(0.001) 0.013***
(0.003)
0.000
(0.001)
Child -0.012
(0.017) 0.021**
(0.009)
-0.014
(0.015)
0.006
(0.007)
-0.012
(0.015)
0.001
(0.003) 0.061***
(0.015)
0.000
(0.006)
Vegetarian 0.004
(0.036) 0.036**
(0.019)
0.014
(0.035)
0.008
(0.012)
0.025
(0.035)
0.001
(0.005)
-0.033
(0.035)
-0.000
(0.013)
Spring -0.008
(0.024) -0.015*
(0.008)
0.018
(0.024)
0.005
(0.008)
0.024
(0.024)
0.003
(0.004) -0.095***
(0.024)
0.002
(0.008)
Summer 0.076***
(0.025)
-0.003
(0.008)
-0.031
(0.024) 0.079**
(0.007)
-0.055**
(0.025)
0.036***
(0.003)
0.256***
(0.024)
0.014***
(0.006) Fall 0.035***
(0.024)
-0.013
(0.008)
0.003
(0.023) 0.037***
(0.007)
-0.004
(0.023) 0.018***
(0.003)
0.049**
(0.023)
0.007
(0.007)
127
Table 3-6. Continued
Pr(Y=1) Pr(Y=2) Pr(Y=3) Pr(Y=4)
Triple-hurdle Double-
hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle
Price -0.101***
(0.013)
-0.024***
(0.006)
-0.009
(0.012) -0.058***
(0.005)
0.016
(0.012) -0.024***
(0.002)
-0.098***
(0.013)
-0.009***
(0.004)
Taste 0.046***
(0.037)
0.014
(0.007)
-0.068
(0.039)
0.106
(0.005)
-0.100***
(0.037) 0.047***
(0.003)
0.348***
(0.003)
0.018***
(0.005)
Age 0.000
(0.039)
-0.002
(0.001) 0.010***
(0.002)
-0.007***
(0.001)
0.012***
(0.002) -0.003***
(0.001)
-0.045***
(0.002)
-0.001
(0.001)
Food
Expenditure 0.012***
(0.005)
0.0065
(0.002)
-0.013**
(0.004)
0.019***
(0.002)
-0.019***
(0.004)
0.009***
(0.001)
0.080***
(0.004)
0.003***
(0.001)
People_num -0.013
(0.012)
-0.004
(0.005)
0.007
(0.010)
-0.003
(0.005)
0.010
(0.010)
-0.001
(0.002) -0.057***
(0.010)
-0.000
(0.004) (***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model with regard to each information criteria is indicated
with bold.
128
CHAPTER 4
DISCUSSION
Count data has been heavily employed when analyzing consumer behavior and market
segmentation. Consumption count data is quite unique because it usually contains many
observations of zero-consumption, which provides a challenge regarding statistical modeling.
This challenge becomes more prominent when there are potentially different types of zero-
consumption generated from different mechanisms. This dissertation aims to answer the
questions of how to correctly understand, explain, and model the abundant zero consumption
observations when analyzing consumer behavior and market segmentation.
Numerous statistical models have been developed to handle count data with the issue of
zero-inflation and over-dispersion. The most commonly used statistical methods are Zero-
inflated models and Hurdle models, and each model has either a Poisson formulation or Negative
binomial formulation in terms of model specification, with the latter allowing for larger data
over-dispersion. In empirical analysis of consumption behavior, researchers have used various
statistical models based on the nature of the datasets and assumptions, trying to understand the
factors influencing consumer decisions on both market participation and consumption frequency
(Hall, 2004; Hendrix and Haggard, 2015; Almasi et al., 2016).
Although these diverse statistical methods have been extensively employed in the
empirical analysis of consumption data, the performance of these models given different data
characteristics, in particular, different zero proportions are relatively limited. It is indicated that
the most significant difference between the Hurdle models and zero-inflated models is that the
latter allows both structural zeros and sampling zeros while the former assumes only one type of
zero. In marketing analysis, if the data has exclusion criteria (for example, only allowing
consumers to participate in the survey collection), then it will be more appropriate to employ the
129
Hurdle model to handle the case of zero-inflation. However, when the proportion of observed
zero consumption is large (more than 50%), Hurdle models have relative poor predication
capability with large bias. In addition, if the research has no prior underlying assumtpion about
the types of zeros, it is highly suggested to employ Zero-inflated models to handle the issue of
zero-inflation. Considering the distribution formulation, the negative binomial formulation is
preferred to Poisson given the existence of data over-dispersion.
Focusing on the analysis of market segmentation, there are three potential consumer
groups on the market: non-participants, potential consumers, and consumers. The Hurdle model
assumes that consumers should pass two stages before observing a positive consumption
frequency, which means that it only allows zero-consumption to happen in the first stage. In
other words, Hurdle models have limited ability to explain non-participants. Zero-inflated
models also assume a two-stage decision-making process for consumption behavior, and allows
zero-consumption to occur in both stages. This means Zero-inflated models can handle the
existence of both non-consumers and potential-consumers. Thus, if there is an underlying
assumption of three different types of consumers in the market, the zero-inflated models are
more appropriate.
Despite the advantage of Zero-inflated models when both structural zeros (non-
consumers) and sampling zeros (potential consumers) exist, the Zero-inflated models still fail to
differentiate between the two types of zero-consumption. Because of that, Zero-inflated models
still assume consumers make a two-stage decision -– participation intention and consumption
frequency (though consumption frequency can be chosen at zero). This is restrictive as it
assumes that the factors influencing potential consumers and consumers are the same, which
might not always be true. For example, it is possible that promotions exert a larger impact on
130
potential consumers than consumers. Therefore, a Triple Hurdle Count Data model has been
proposed in this study, which allows us to observe the three different groups of consumers.
Based on the three-stage approach, the participation intention is observed in the first stage, and
conditional on the participation decision, consumers would further make the subsequent
consumption intention and consumption intensity decisions. The employment of the Triple
Hurdle Count Data is helpful to provide more detailed information to classify three types of
consumers in the market: non-consumers, potential consumers, and consumers, and explores the
appropriate structurally different reasons explaining the three groups market participation,
consumption intention, and consumption intensity in sequence.
131
LIST OF REFERENCES
Almasi, A., Rahimiforoushani, A., Eshraghian, M. R., Mohammad, K., Pasdar, Y., Tarrahi, M.
J., ... & Jouybari, T. A. (2016). Effect of Nutritional Habits on Dental Caries in
Permanent Dentition among Schoolchildren Aged 10–12 Years: A Zero-Inflated
Generalized Poisson Regression Model Approach. Iranian journal of public
health, 45(3), 353.
Arabmazar, A., & Schmidt, P. (1982). An investigation of the robustness of the Tobit estimator
to non-normality. Econometrica: Journal of the Econometric Society, 1055-1063.
Atkins, D. C., & Gallop, R. J. (2007). Rethinking how family researchers model infrequent
outcomes: a tutorial on count regression and zero-inflated models. Journal of Family
Psychology, 21(4), 726.
Akaike, H. I973. Information theory as an extension of the maximum likelihood principle.
In Second International Symposium on Information Theory. Edited by BN Petrov and F.
Csaki. Akadcmiai Kiado, Budapest, Hungary.
Bandyopadhyay, D., DeSantis, S. M., Korte, J. E., & Brady, K. T. (2011). Some considerations
for excess zeros in substance abuse research. The American journal of drug and alcohol
abuse, 37(5), 376-382.
Bethell, J., Rhodes, A. E., Bondy, S. J., Lou, W. W., & Guttmann, A. (2010). Repeat self-harm:
application of hurdle models. The British Journal of Psychiatry, 196(3), 243-244.
Bezu, S., Kassie, G. T., Shiferaw, B., & Ricker-Gilbert, J. (2014). Impact of improved maize
adoption on welfare of farm households in Malawi: a panel data analysis. World
Development, 59, 120-131.
Binkley, J. K. (2006). The effect of demographic, economic, and nutrition factors on the
frequency of food away from home. Journal of consumer Affairs, 40(2), 372-391.
Dutang, C., Goulet, V., & Pigeon, M. (2008). actuar: An R package for actuarial science. Journal
of Statistical software, 25(7), 1-37.
Calsyn, D. A., Hatch-Maillette, M., Tross, S., Doyle, S. R., Crits-Christoph, P., Song, Y. S., ... &
Berns, S. B. (2009). Motivational and skills training HIV/sexually transmitted infection
sexual risk reduction groups for men. Journal of Substance Abuse Treatment, 37(2), 138-
150.
Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (Vol. 53).
Cambridge university press.
Cannuscio, C. C., Tappe, K., Hillier, A., Buttenheim, A., Karpyn, A., & Glanz, K. (2013). Urban
food environments and residents’ shopping behaviors. American journal of preventive
medicine, 45(5), 606-614.
132
Civettini, A. J., & Hines, E. (2005). Misspecification effects in zero-inflated negative binomial
regression models: Common cases. In annual meeting of the Southern Political Science
Association, New Orleans.
Consul, P. C., & Jain, G. C. (1973). A generalization of the Poisson
distribution. Technometrics, 15(4), 791-799.
Cragg, J. G. (1971). Some statistical models for limited dependent variables with application to
the demand for durable goods. Econometrica: Journal of the Econometric Society, 829-
844.
Crowley, F., Eakins, J., & Jordan, D. (2013). Participation, expenditure and regressivity in the
Irish lottery: Evidence from Irish household budget survey 2004/2005. The Economic and
Social Review, 43(2, Summer), 199-225.
Desouhant, E., Debouzie, D., & Menu, F. (1998). Oviposition pattern of phytophagous insects:
on the importance of host population heterogeneity. Oecologia, 114(3), 382-388.
Desjardins, C. D. (2013). Evaluating the performance of two competing models of school
suspension under simulation-the zero-inflated negative binomial and the negative
binomial hurdle. University of Minnesota.
Duan, N., Manning, W. G., Morris, C. N., & Newhouse, J. P. (1983). A comparison of
alternative models for the demand for medical care. Journal of business & economic
statistics, 1(2), 115-126.
Famoye, F., & Singh, K. P. (2006). Zero-inflated generalized Poisson regression model with an
application to domestic violence data. Journal of Data Science, 4(1), 117-130.
Greene, W. H. (1994). Accounting for excess zero and sample selection in Poisson and negative
binomial regression models.
Greenwood, M., & Yule, G. U. (1920). An inquiry into the nature of frequency distributions
representative of multiple happenings with particular reference to the occurrence of
multiple attacks of disease or of repeated accidents. Journal of the Royal statistical
society, 83(2), 255-279.
Gurmu, S. (1998). Generalized hurdle count data regression models. Economics Letters, 58(3),
263-268.
Hall, D. B. (2000). Zero‐ inflated Poisson and binomial regression with random effects: a case
study. Biometrics, 56(4), 1030-1039.
Hall, D. B., & Berenhaut, K. S. (2002). Score tests for heterogeneity and overdispersion in zero‐inflated Poisson and binomial regression models. Canadian journal of statistics, 30(3),
415-430.
133
Hall, D. B., & Zhang, Z. (2004). Marginal models for zero inflated clustered data.
Statistical Modelling, 4(3), 161-180.
Han, E., & Powell, L. M. (2013). Consumption patterns of sugar-sweetened beverages in the
United States. Journal of the Academy of Nutrition and Dietetics, 113(1), 43-53.
Harris, M. N., & Zhao, X. (2007). A zero-inflated ordered probit model, with an application to
modelling tobacco consumption. Journal of Econometrics, 141(2), 1073-1099.
Hendrix, C. S., & Haggard, S. (2015). Global food prices, regime type, and urban unrest in the
developing world. Journal of Peace Research, 52(2), 143-157.
Hu, M. C., Pavlicova, M., & Nunes, E. V. (2011). Zero-inflated and hurdle models of count data
with extra zeros: examples from an HIV-risk reduction intervention trial. The American
journal of drug and alcohol abuse, 37(5), 367-375.
Huang, H., & Chin, H. C. (2010). Modeling road traffic crashes with zero-inflation and site-
specific random effects. Statistical Methods & Applications, 19(3), 445.
Jackman,S. (2017). pscl: Classes and Methods for R Developed in the Political Science
Computational Laboratory. United States Studies Centre, University of Sydney. Sydney,
New South Wales, Australia. R package version 1.5.1.
URL https://github.com/atahk/pscl/
Jaunky, V. C., & Ramchurn, B. (2014). Consumer behaviour in the scratch card market: a
double-Hurdle approach. International Gambling Studies, 14(1), 96-114.
Jiang, Y., House, L., Tejera, C., & Percival, S. S. (2015, January). Consumption of Mushrooms:
A double-Hurdle Approach. In 2015 Annual Meeting, January 31-February 3, 2015,
Atlanta, Georgia (No. 196902). Southern Agricultural Economics Association.
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1997). Discrete multivariate distributions (Vol.
165). New York: Wiley.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics, 34(1), 1-14.
Lesser, L. I., Zimmerman, F. J., & Cohen, D. A. (2013). Outdoor advertising, obesity, and soda
consumption: a cross-sectional study. BMC Public Health, 13(1), 20.
Matheson, F. I., White, H. L., Moineddin, R., Dunn, J. R., & Glazier, R. H. (2012). Drinking in
context: the influence of gender and neighbourhood deprivation on alcohol
consumption. Journal of epidemiology and community health, 66(6), e4-e4.
McCullagh, P., & Nelder, J. A. (1989). Generlised Linear Models, (2nd ed.). London:
Chapman & Hall.
134
Miller, J. M. (2007). Comparing Poisson, Hurdle, and ZIP model fit under varying degrees of
skew and zero-inflation(Doctoral dissertation, University of Florida).
Min, Y., & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated
count data. Statistical modelling, 5(1), 1-19.
Morales, L. E., & Higuchi, A. (2017). Is fish worth more than meat?–How consumers’ beliefs
about health and nutrition affect their willingness to pay more for fish than meat. Food
Quality and Preference.
Morland, K., Wing, S., Roux, A. D., & Poole, C. (2002). Neighborhood characteristics
associated with the location of food stores and food service places. American journal of
preventive medicine, 22(1), 23-29.
Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of
econometrics, 33(3), 341-365.
Neelon, B. H., O’Malley, A. J., & Normand, S. L. T. (2010). A Bayesian model for repeated
measures zero-inflated count data with application to outpatient psychiatric service
use. Statistical Modelling, 10(4), 421-439.
Nelder, J. A. (1989). Generalized linear models.
Nelder, J. A., & Baker, R. J. (1972). Generalized linear models. John Wiley & Sons, Inc.
R Core Team. (2012). R: A language and environment for statistical computing
[Computer software manual]. Vienna, Austria. Retrieved from
http://www.R-project.org/ (ISBN 3-900051-07-0)
Ridout, M., Hinde, J., & DeméAtrio, C. G. (2001). A Score Test for Testing a Zero‐ Inflated
Poisson Regression Model Against Zero‐ Inflated Negative Binomial
Alternatives. Biometrics, 57(1), 219-223.
Rose, C. E., Martin, S. W., Wannemuehler, K. A., & Plikaytis, B. D. (2006). On the use of zero-
inflated and hurdle models for modeling vaccine adverse event count data. Journal of
biopharmaceutical statistics, 16(4), 463-481.
Shonkwiler, J. S., & Shaw, W. D. (1996). Hurdle count-data models in recreation demand
analysis. Journal of Agricultural and Resource Economics, 210-219.
Slymen, D. J., Ayala, G. X., Arredondo, E. M., & Elder, J. P. (2006). A demonstration of
modeling count data with an application to physical activity. Epidemiologic
Perspectives& Innovations, 3(1), 3
Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica:
journal of the Econometric Society, 24-36.
135
Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested
hypotheses. Econometrica: Journal of the Econometric Society, 307-333.
Warton, D. I. (2005). Many zero does not mean zero inflation: comparing the goodness‐ of‐ fit
of parametric models to multivariate abundance data. Environmetrics, 16(3), 275-289.
Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmayer, D. B. (1996). Modelling the
abundance of rare species: statistical models for counts with extra zeros. Ecological
Modelling, 88(1-3), 297-308.
Wedderburn, R. W. (1974). Quasi-likelihood functions, generalized linear models, and the
Gauss—Newton method. Biometrika, 61(3), 439-447.
Wenger, S. J., & Freeman, M. C. (2008). Estimating species occurrence, abundance, and
detection probability using zero-inflated distributions. Ecology, 89(10), 2953-2959.
Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmayer, D. B. (1996). Modelling the
abundance of rare species: statistical models for counts with extra zeros. Ecological
Modelling, 88(1-3), 297-308.
Yen, S. T., & Huang, C. L. (1996). Household demand for Finfish: a generalized double-Hurdle
model. Journal of agricultural and resource economics, 220-234.
Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of
statistical software, 27(8), 1-25.
Zorn, C. J. (1996). Evaluating zero-inflated and hurdle Poisson specifications. Midwest Political
Science Association, 18(20), 1-16.
136
BIOGRAPHICAL SKETCH
Yuan Jiang received her Ph.D. degree from the Department of Food and Resource
Economics at the University of Florida in the spring of 2018. Her research is focused on
consumer behaviors, agricultural marketing, and agribusiness. Prior to commencing her Ph.D.
study at University of Florida, she received her M.S degree in agricultural economics, and M.S
degree in statistics from University of Florida, the B.E degree in economics from Shandong
University, China.