To my Mom and Dad - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/05/17/08/00001/JIANG_Y.pdf · analysis of consumption behaviors and market structure with excess zeros and

ANALYSIS OF CONSUMPTION BEHAVIORS AND MARKET STRUCTURE WITH

EXCESS ZEROS AND OVER-DISPERSION

By

YUAN JIANG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

© 2018 Yuan Jiang

To my Mom and Dad

4

ACKNOWLEDGMENTS

I would like to take this opportunity to give my gratitude to those who have helped me in

the way of completing my dissertation. I thank my supervisor Dr. Lisa House, whose profound

knowledge in the field provides me valuable advice and sparks my insightful thinking. I am

grateful for her tremendous help, support, patience and encouragement throughout this

wonderful journey of challenge and fulfillment. All of the achievements during my study would

not have been possible without her guidance and advice.

I would also express my great appreciation to all the professors on my committee. Dr.

Zhifeng Gao has offered me considerable help and guidance in various aspects, including the

simulation study design, data analysis, and cheerful encouragement. I also thank Dr. Brandon

Mcfadden, Dr. Hyeyoung Kim and Dr. Zhihua Su for their suggestions and help along the

completion of this dissertation.

I would like to thank the Food and Resource Economics Department for providing me the

chance to study and obtain valuable professional training. Without its support, nothing would be

possible. In addition, I would express a heartfelt appreciation for my dear friends and fellow

graduate students for their support and encouragement.

From the bottom of my heart, I want to express my gratitude to all of my family for their

love and support, especially my parents.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...............................................................................................................4

LIST OF TABLES ...........................................................................................................................7

LIST OF FIGURES .......................................................................................................................10

ABSTRACT ...................................................................................................................................11

CHAPTER

1 INTRODUCTION ..................................................................................................................13

2 COMPARISON OF THE PERFORMANCE OF COUNT DATA MODELS UNDER

DIFFERENT ZERO-INFLATION SCENARIOS USING SIMULATION STUDIES .........20

Background .............................................................................................................................20

Literature Review ...................................................................................................................23 Count Data and Generalized Linear Model .....................................................................23 Poisson Regression Model and Applications ..................................................................25

Problems with the Poisson Regression ............................................................................25 Negative Binomial Regression Models and Applications ...............................................27

Zero-inflated Models and Applications ...........................................................................29 Hurdle Model and Applications ......................................................................................32

Comparison of the Models ..............................................................................................35 Gaps and Shortcomings ...................................................................................................37

Method ....................................................................................................................................38 Research Questions .........................................................................................................38 Monte Carlo Simulation ..................................................................................................39

Simulation Study Design .................................................................................................40 Model Evaluation of the Simulation Studies ...................................................................43 Data Generating Mechanism ...........................................................................................44

Results.....................................................................................................................................44

Pseudo-Population Zero-inflated Poisson Model ............................................................44

Model fit ...................................................................................................................45

Relative bias of E(Y|X) ............................................................................................45 Abilities of capturing zero observation ....................................................................46

Pseudo-Population Hurdle Poisson Model ......................................................................47 Model fit ...................................................................................................................48 Relative bias of E(Y|X) ............................................................................................48

Abilities of capturing zero observation ....................................................................49 Pseudo-Population Zero-inflated Negative Binomial Model ..........................................49

Model fit ...................................................................................................................50 Relative bias of E(Y|X) ............................................................................................51 Abilities of capturing zero observation ....................................................................52

6

Pseudo-Population Hurdle Negative Binomial Model ....................................................53 Model fit ...................................................................................................................54 Relative bias of E(Y|X) ............................................................................................54 Abilities of capturing zero observation ....................................................................56

Compare the Mis-specified Models across All the True Models ....................................56 Conclusion ..............................................................................................................................58

Model Performance Given Four Pseudo-Populations .....................................................60 True model - Zero-inflated Poisson Model ..............................................................60 True model - Hurdle Poisson Model ........................................................................61

True model - Zero-inflated Negative Binomial Model ............................................61 True model - Hurdle Negative Binomial Model ......................................................62

Comparison between Different Models ...........................................................................63

Poisson formulation versus Negative-Binomial formulation ...................................63 Zero-inflated models versus Hurdle models ............................................................66

Capability of Predicting Zero Observations ....................................................................67

Future Work and Limitations ..........................................................................................68

3 A TRIPLE HURDLE COUNT DATA MODEL OF MARKET PARTICIPATION

AND CONSUMPTION ..........................................................................................................87

Background .............................................................................................................................87 The Econometric Modeling of Count Data for Consumption Behavior .................................89

Motivation ...............................................................................................................................92 Conceptual Framework ...........................................................................................................94

Econometric Framework ........................................................................................................95 Triple Hurdle Count Data Model with Independent Stages ............................................95

Triple Hurdle Count Data Model with Interdependence .................................................98 Marginal Effects and Interpreting Results .....................................................................102

Comparing Triple Hurdle Count Data Model and the Double Hurdle Models .............105 Variables and Data ................................................................................................................106

Data Set .........................................................................................................................106

Variables ........................................................................................................................107 Results...................................................................................................................................109

Regression results of Triple Hurdle Count Data Model Results ...................................110 Marginal Effects of the Triple Hurdle Count Data Model ............................................114

Conclusion ............................................................................................................................116

4 DISCUSSION .......................................................................................................................128

LIST OF REFERENCES .............................................................................................................131

BIOGRAPHICAL SKETCH .......................................................................................................136

7

LIST OF TABLES

Table page

2-1 Convergence rate, true model is ZIP.................................................................................70

2-2 Mean Loglikelihood, true model is ZIP .............................................................................70

2-3 Mean AIC, true model is ZIP .............................................................................................70

2-4 Relative Bias for E(Y|X), true model is ZIP ......................................................................71

2-5 Observed and predicted zero observations, true model is ZIP ...........................................71

2-6 Convergence rate, true model is HP ..................................................................................71

2-7 Mean Log-likelihood, true model is HP ............................................................................72

2-8 Mean AIC, true model is HP..............................................................................................72

2-9 Relative Bias for E(Y|X), true model is HP .......................................................................72

2-10 Observed and predicted zero observations, true model is HP ...........................................73

2-11 Convergence rate, true model is ZINB ..............................................................................73

2-12 Mean Log-likelihood, true model is ZINB (disperison=0.5) .............................................73

2-13 Mean AIC, true model is ZINB (disperison=0.5) ..............................................................74

2-14 Mean loglikelihood, true model is ZINB (dispersion=1) ..................................................74

2-15 Mean AIC, true model is ZINB (disperison=1) .................................................................74

2-16 Mean loglikelihood, true model is ZINB (dispersion=2) ..................................................75


2-18 Mean Loglikelihood, true model is ZINB (disperison=4) .................................................76


2-20 Mean AIC of ZINB model, true model is ZINB ................................................................76

2-21 Relative Bias for E(Y|X), true model is ZINB (dispersion=0.5) .......................................77

2-22 Relative Bias for E(Y|X), true model is ZINB (dispersion=1) ..........................................77


8


2-25 Relative Bias for E(Y|X) of ZINB model, true model is ZINB .........................................78

2-26 Observed and predicted zero observations, true model is ZINB (disipersion=0.5) ...........78

2-27 Observed and predicted zero observations, true model is ZINB (disipersion=1) ..............79



2-30 Convergence rate, true model is HNB ...............................................................................80

2-31 Mean Loglikelihood, true model is HNB (disperison=0.5) ...............................................80

2-32 Mean AIC, true model is HNB (disperison=0.5) ...............................................................80

2-33 Mean Loglikelihood, true model is HNB (disperison=1) ..................................................81

2-34 Mean AIC, true model is HNB (disperison=1) ..................................................................81





2-39 Mean AIC of HNB, true model is HNB ............................................................................83

2-40 Relative Bias for E(Y|X), true model is HNB (dispersion=0.5) ........................................83

2-41 Relative Bias for E(Y|X), true model is HNB (dispersion=1) ...........................................83



2-44 Relative Bias for E(Y|X) of HNB model, true model is HNB ...........................................84

2-45 Observed and predicted zero observations, true model is HNB(dispersion=0.5) ..............85

2-46 Observed and predicted zero observations, true model is HNB(dispersion=1) .................85



9

2-49 Average AIC statistics across all the models .....................................................................86

2-50 Average Relative Bias across all the models .....................................................................86

3-1 Variable Descriptions.......................................................................................................120

3-2 Estimated probabilities for fresh blueberry consumption ................................................122

3-3 Fresh blueberry consumption: summary statistics from double hurdle approach and

triple hurdle model ...........................................................................................................122

3-4 Fresh blueberry consumption: regression results .............................................................123

3-5 Marginal Effects for Triple Hurdle Count Data Model ...................................................125

3-6 Comparison of the marginal effects for Triple Hurdle Count Data Model and Double

Hurdle Count Model ........................................................................................................126

10

LIST OF FIGURES

Figure page

3-1 Zero consumption of fresh blueberry per month .............................................................119

3-2 Diagram of the data generating process of the Triple Hurdle Count Data model ...........119

11

Abstract of Dissertation Presented to the Graduate School

of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Doctor of Philosophy

ANALYSIS OF CONSUMPTION BEHAVIORS AND MARKET STRUCTURE WITH

EXCESS ZEROS AND OVER-DISPERSION

By

Yuan Jiang

May 2018

Chair: Lisa House

Major: Food and Resource Economics

There is a long history of interest in modeling consumer behavior and predicting market

structure in agricultural economics. When analyzing consumption behavior at the individual

level, the data is frequently formatted as count data, especially when measuring consumption

frequency or intensity during a given period. However, this type of data is often characterized by

an excess of zero observations (zero-inflation) and a heavy right tail (dispersion). These factors

influence whether or not the Poisson regression model, typically used with count data, can be

appropriately applied.

To solve the shortcomings of the Poisson regression model, a number of modified

Poisson regression models have been developed. The most popular approaches are the Zero-

inflated/modified model and the Hurdle model, which differ based on different assumptions

about the sources of zero observations. The first part of this dissertation aims to review and

evaluate the performance of these modified models given different levels of zero-inflation and

over-dispersion using a simulation study regarding model fit and prediction capability.

Furthermore, special attention will be given to the comparison of the ability of the models to

predict the correct latent classes, as well as understanding the consequences of model

12

misspecification when the data-generating mechanism is improperly specified. Based on this

analysis, the Zero-inflated models are preferred to the Hurdle models, especially regarding model

fit and prediction capability. If the assumption of zero is stressed and interest is focused on the

accuracy of zero predictions, then the Hurdle models should still be considered. What is more,

this analysis also verified that the models with Negative Binomial formulations are preferred to

model with Poisson formulations in the case when the data has the problem of over-dispersion.

The second part of this dissertation proposes a new approach, a Triple Hurdle Count Data

model to analyze consumption behavior. This new model extends the Zero-inflated models and

Hurdle Count Models to market participation modeling. It will allow us to identify consumer

participation, desire, and acquisition separately, and to explore the appropriate structurally

different reasons explaining consumers’ decisions on market participation, consumption

intention, and consumption intensity in sequence. The new model is applied to a consumer

choice problem of blueberry consumption to discuss the difference in insights gained by

employing the Triple Hurdle Count Data model compared to the Double Hurdle approach.

13

CHAPTER 1

INTRODUCTION

There is a long history of interest in modeling consumer behavior and predicting market

segmentation in economics, in particular, understanding consumers’ preferences and

consumption. When analyzing consumer behavior on the individual level, especially when using

a survey to collect the primary data, consumption is frequently recorded in the form of count

data. Count data usually occurs when measuring consumption frequency or intensity during a

certain period. The focus and motivation of this dissertation is on how to model and understand

consumption behavior using count data, and how to better predict market segmentation using

appropriate econometric methods.

To analyze consumer behavior, and to predict the market segmentation, it is very

important to explore the factors that influence consumption behavior, which includes decisions

on both market participation and consumption. However, data on the frequency of how often

consumers choose to purchase in a given period presents an interesting statistical challenge as

there are many observations recorded as zero. For example, in a survey eliciting information on

blueberry consumption, answers to the question “How often did you consume blueberries last

month?” will include many respondents answering none (zero), and the number of zero

observations will vary depending on the time of the year. Typically, this type of data will be

characterized by an excess of zero observations (or zero-inflation) that influences modeling and

interpretation of the data.

Regarding consumption decisions, these abundant zero-observations can be the result of

three different reasons. First, it is possible that some individuals have a non-positive desire for

the product. In other words, these individuals will not be consumers of the product because of

some permanent reason (such as an allergy). Second, while some individuals do desire to

14

purchase a product, they do not consume for some temporary reason (for example, the current

price of the product is greater than the upper bound of their willingness, or ability, to pay at the

given income level). In this case, the zero consumption is the corner solution for the individual’s

utility-maximizing decision. Third, some individuals have positive market participation desire,

but zero consumption could be observed due to the infrequency of purchase (for example, they

purchase, but not during the period surveyed). This situation could often happen in the case of

durable goods.

Regarding market segmentation, individuals who choose zero consumption because of

the first reason are considered non-consumers, and these corresponding zero observations are

called structural zeros. The individuals who have positive market participation desire but were

observed consuming zero units because of the second and third reasons are considered as

potential consumers, and the corresponding zero observations are called sampling zeros. Thus,

the particular interpretations given to these zero consumption observations will have a crucial

bearing on the estimation techniques, and the interpretation of results, especially for market

segmentation.

Considering the appropriate statistical modeling, when the dependent variable is count

data, one of the most commonly used statistical methods is the Poisson regression. Poisson

regression is commonly used by economists to model the number of events, like the frequency of

consumption. However, the Poisson model fails to provide an adequate fit when there exists the

problem of “excessive zeros.” When “excessive zeros” exist, the data mean is pulled towards

zero, causing a violation of the assumption of mean-variance equality which is part of the

Poisson regression model.

15

To address this potential shortcoming of the Poisson regression model, a number of

modified Poisson regression models have been developed. The Zero-inflated Poisson (ZIP)

model was first proposed by Lambert (1992) and is commonly used. Following this, the Zero-

inflated Negative Binomial was developed to further handle the problem of data over-dispersion,

as well as to address the issue of inequality of the mean and variance (Consul and Jain, 1973;

Famoye and Singh, 2006).

The Zero-inflated count data models assume that the zero observations come from two

distinct sources: “sampling zeros” and “structured zeros.” When applied to consumption

analysis, zero-inflated count data models allow for zero-consumption to come from both cases

where the consumer is a genuine non-participant (structural zero), and when zero consumption is

the corner solution of a standard consumer demand problem (sampling zero).

Different from the Zero-inflated count data model, the Hurdle model proposed by

Mullahy (1986) assumes that all the zeros are structural zeros. When applied to consumption

analysis, Hurdle models assume that individuals need to pass two stages before being observed

with a positive consumption intensity: a participation decision and a consumption decision.

Furthermore, Hurdle models assume the participation stage dominant. Thus, all zero observations

are assumed to be generated in the first stage (decision on whether to consume), and in the

second stage, the consumption behavior is truncated at zero. Thus, all zeros are treated as

structural zeros, whether they are structural or sampling zeros. Hurdle Negative Binomial models

have also been developed later (Gurmu, 1998), with the purpose to better handle the issue of

over-dispersion.

The choice between Hurdle models and Zero-inflated models should be based on whether

the researcher believes that all zero observations come from the structural zero group or that at

16

least some of the zeros are sampling random zeros. Based on consumption analysis research, the

choice between Hurdle models and Zero-inflated models should mainly depend on whether or

not the researchers believe that the potential consumers (those that would consume under the

correct circumstances) exist in the market. However, if there is not a clear definition between the

two groups of zeros in the market, then the choice between these two models would depend on

the model performance, which includes both model fit and predictive capabilities.

There is relatively little literature comparing and evaluating the performance of these

count data models, and results from the existing research are conflicting with respect to which

model is superior. For example, Green (1994) found that the negative binomial model was

superior to the ZIP model, and the ZIP model was superior to the Poisson model. Conversely,

Lambert (1992) argued that the ZIP model has superior fit compared to the Negative Binomial

model. Needlon et al. (2010) found that the ZIP model fits better than the Poisson and Hurdle

models, while Welsh et al. (1996) found the Hurdle and ZIP models to be equal. Based on Miller

(2007)’s research, the discrepant results of the model comparisons might because the datasets

used in the analyses are quite different in the proportion of zeros, with some research using data

with as low as 20% zeros, and some datasets with as high as 90% zeros.

Additionally, there have been fewer studies comparing the different count data models

with zero-inflation and over-dispersion using simulated data. Lambert (1992) proposed the zero-

inflated Poisson model and evaluated its performance using simulation studies. Miller (2007)

compared the Poisson, hurdle, and zero-inflated models under varying zero-inflation levels; and

Desjardins (2013) evaluated the performance of the zero-inflated negative binomial and negative

binomial hurdle models under simulation. However, most of these previous studies mainly focus

17

on model fit and parameter recovery, and rarely cared about the models’ capacity of predicting

the different zero types.

Focusing on the comparison of Zero-inflation and Hurdle models, there is even less

comparison from previous research. A critical assumption of Zero-inflated models is that there

exists both structural and sampling zeros, yet no simulation studies analyzed whether the Zero-

inflated models can efficiently predict the group of structural zeros from the sampling zeros. If

the Zero-inflated models cannot differentiate the two types of zeros, then the utility of using the

Zero-inflated models over Hurdle models is limited. Previously, only Desjardins (2013) tried to

compare the Zero-inflated Negative Binomial model with the Hurdle Negative Binomial model

given a fixed level of zero proportion. However, no previous studies have examined the Zero-

inflated models versus Hurdle models with both Poisson and Negative Binomial formulations

given different levels of zero proportion and over-dispersion levels. Therefore there is no

guidance from prior research that examines how different levels of zero portion and over-

dispersion will impact the Zero-inflation models and Hurdle models on model fit and capabilities

of predicting different types of zero observations. For analysis of consumption behavior, how the

models make forecasts of different types of consumers (i.e. different reasons for zeros), is very

important for predicting and making recommendations related to market segmentation. It

becomes even more critical when choosing between Hurdle models, which assume no sampling

zeros, and Zero-inflated models, which allow both structural and sampling zeros.

To my knowledge, there have been no simulation studies that compare Zero-inflated

models and Hurdle models with both Poisson distribution and Negative binomial formulations,

and no prior studies that investigate how zero proportion and over-dispersion will affect model

fit in these models. Furthermore, regarding the importance of differentiating different types of

18

zeros, particular attention should be given to the comparison of different models’ capabilities to

predict the correct latent classes, which has not yet been explored by the previous research. In the

case of consumption analysis, we will compare the models’ capabilities of predicting market

segmentation.

Additionally, in the analysis of consumption behavior, all count data models listed above

have been employed to analyze consumers’ consumption intention and intensity. Unfortunately,

none of these models can specify the consumers’ actual consumption desire in the model

specifications. Although the Hurdle models allow consumption behavior to be divided into two

stages, and the Zero-inflated models assume the zero observations could either be non-

consumers or potential consumers, they could not differentiate the two different types of zero

observations in the model specification, thus they fail to differentiate the two types of zero

observations by observing their true consumption desire. What is more, both the Zero-inflated

and Hurdle models are designed on a two-stage structure. Although they assume that the factors

influencing non-consumers are different from others, they still impose a strong restriction that

the same set of factors influences potential consumers and consumers with positive consumption

intensity, which may not always be true.

To overcome the shortcomings mentioned above, it is important to develop a new count

data model which is able to test and treat for the two types of zero observations separately in the

analysis of consumption behavior and to allow three different sets of factors influencing the three

different groups: non-participants, potential consumers, and consumers. Thus, a Triple Hurdle

count data model is developed, which allows for the identification of non-consumers (in the first

hurdle), potential consumers (in the second hurdle), and consumption intensity (in the third

stage). This new model has better capability in market segmentation analysis, and it also allows a

19

deeper insight of the characteristics of different types of consumers. In order to generalize the

findings of this research, the new model will be compared with the double hurdle approach by

applying both models to the analysis of the fresh blueberry consumption.

The structure of the dissertation is as follows. In Chapter 2, a review and comparison of

the performance of the six most popular methods of modeling count data, especially when the

data has the issue of zero-inflation and dispersion is provided. This includes the basic Poisson

models and a discussion about several remedies to the Poisson models (including negative

binomial models, zero-inflated Poisson model, zero-inflated negative binomial model, hurdle

Poisson model and hurdle negative binomial model). For each model, we will discuss their

characteristics and applications in the field of consumption behavior. This chapter will also

include an evaluation of the six models using simulation studies under different levels of

dispersion and zero proportions, and compare the model's performance regarding model fit,

predictive capability, and particularly the capability to predict zero observations.

In Chapter 3, a new approach, a Triple Hurdle Count Data model, is proposed to analyze

consumption behavior, which assumes a three-stage decision making process. By differentiating

three different types of consumers in the model specification, this model allows three different

generating processes correlated with the three different types of consumers. This model will be

applied to a consumer choice problem of blueberry consumption together with the Double

Hurdle approach, to discuss the difference in insights gained by employing this new model.

Finally, Chapter 4 concludes with a discussion of the results and implication for the

consumption behavior analysis when the data employed is the count data with the issue of zero-

inflation and over-dispersion. A discussion about the appropriate statistical methods when trying

to understand the market structure and segmentation is also provided.

20

CHAPTER 2

COMPARISON OF THE PERFORMANCE OF COUNT DATA MODELS UNDER

DIFFERENT ZERO-INFLATION SCENARIOS USING SIMULATION STUDIES

Background

In statistical modeling, when the dependent variable is formatted as count data, the most

popular regression technique is the Poisson regression model. However, the Poisson model fails

to provide an adequate fit when there exists the problem of zero-inflation. Thus, the Poisson

model has been modified to address this issue. The most popular modifications are the Zero-

inflated/Modified Poisson and Hurdle Poisson models. Further, there are Negative Binomial

variations of these models considering the possible issue of dispersion.

The Zero-inflated Poisson (ZIP) model was proposed by Lambert in 1992. Following

this, a number of related models have been proposed, including the Poisson-Negative Binomial

and modified Poisson to address the inequality of the mean and variance (as equality is assumed

for the Poisson distribution) (Famoye and Singh, 2006). The Zero-inflated count data model

assumes that the zero observations come from two distinct sources and are identifed separately as

“sampling zeros” and “Structural zeros.” An example of the different types of zeros can be seen

when analyzing consumption of food products, where zero-consumption could be recorded

when the consumer is genuine non-participant (structural zero), or when the zero consumption is

the corner solution of a standard consumer demand problem (sampling zero).

Different from the Zero-inflated count data model, the Hurdle models proposed by

Mullahy (1986) assume that all zeros are sampling zeros. When applied to consumption analysis,

Hurdle models assume individuals need to pass two stages before being observed with a positive

level of consumption: a participation decision and a consumption decision. Furthermore, the

Hurdle models assume participation dominance, which indicates that all zero observations are

21

assumed generated in the first stage (whether or not to consume), and in the second stage,

consumption behavior is truncated at zero.

Thus, the choice between Hurdle models and Zero-inflated models is typically based on

whether the researcher believes that all the zero observations are coming from the structural zero

group or that some of the zeros are also sampling zeros.

There has been relatively little literature that has compared or evaluated the performances

of these count data models, and the results of the studies that have examined this subject vary in

their conclusions. For example, Green (1994) found that the Negative Binomial model was

superior to the ZIP model, and the ZIP model was superior to the Poisson model interms of

model fit. owever, Lambert (1992) argued that ZIP model had superior model fit to the Negative

Binomial model regarding the prediction error. Needlon et al. (2010) found that the ZIP model

fits better than the Poisson and Hurdle models, while Welsh et al. (1996) found that the Hurdle

and ZIP models to be equal.

Based on Miller’s (2007) research, the discrepant results of the model comparisons might

be because the datasets they employ are quite different in the proportion of zeros, with some

research using data with 20% zeros, and some datasets with as much as 90% zeros. In addition to

differently structured data with respect to zeros, there is also a difference in data sets with

regards to over-dispersion. Desjardins (2013) evaluates the performance of Zero-inflated

Negative Binomial(ZINB) and Negative BinomialHurdle (NBH) models given different levels of

disperion rate.

Most of the comparison are based on the empirical dataset, and very few studies have

used simulated data to test and compare the model performances. Lambert (1992) proposed the

Zero-inflated Poisson model and evaluated its performance using simulation studies. Miller

22

(2007) compares the Poisson, Hurdle, and Zero-inflated models under varying zero-inflation

levels and Desjardins (2013) evaluates the performance of Zero-inflated Negative

Binomial(ZINB) and Negative BinomialHurdle (NBH) models using simulation method under a

given level of zero proportion.

However, most of these previous studies that compare the models’ performance mainly

focus on model fit and parameter recovery. Although this is important, how well the models

predict different categories of zeros can also be of importance, especially for consumption

studies (to predict different types of consumers). When choosing between the Hurdle models

which assume no structural zeros, and Zero-inflated models, which allow both structural and

sampling zeros, knowing their capability to predict market segmentation will be of interest.

To the best of our knowledge, there has been no simulation studies conducted that

compare Zero-inflated and Hurdle models with both Poisson distribution and Negative Binomial

distributions. There are also no prior studies investigating how zero proportions and levels of

dispersion might affect estimations and model fit in the Hurdle models and Zero-inflated models.

Furthermore, special attention can be given to the comparison of model capabilities to predict the

correct latent classes. In the case of consumption analysis, we will compare the models’

capabilities of predicting the market structure and segmentation. The Zero-inflated model

assumes that there are two different types of zeros, and the Hurdle models assumes that there is

only one type of zero. If the Zero-inflated models can not differentiate the different types of

zeros, then its usages will be much limited. Thus, it is critical to test whether the Zero-inflated

models efficiently predict the different types of zero, especially given the various levels of zero

portions.

23

In this study, two research questions will be examined. First, under different levels of

simulation situations which include true model distribution, zero proportion, and dispersion rate,

how will these six count data models (Poisson model, Zero-inflated Poisson model, Hurdle

Poisson model, and their Negative Binomial variations) perform regarding model fit. Secondly,

we will try to explore the consequences of misspecifying the distributions. In particular, we will

pay particular attention to the comparison between Zero-inflated Models and Hurdle models, and

evaluate the proportion of correctly identified structural zeros for the Zero-inflated Models, and

test the consequences, if any, of misspecifying the latent classes for the zeros given different

levels of structural zeros.

Literature Review

Count Data and Generalized Linear Model

Count data occurs very frequently in many different fields of research, especially in the

field of social science. Count data can be used to represent the number of times that an event

occurs under a certain condition or during a certain time; for example, the number of times that

consumers purchase a certain good during a certain period would be an event count. As such, the

response values take the form of discrete non-positive integers. Hence, count data is the,

“…realization of a nonnegative integer-valued random variable” (Cameron and Travedi, 1998).

When analyzing count data, there is an assumption that the number of events is

independently identically distributed with a discrete probability distribution. The most common

probability distributions used to describe count data are the Poisson and Negative Binomial

distributions. The Poisson distribution was derived as a limiting case of the binomial distribution,

with the characteristics of mean-variance equality. The Negative Binomial distribution was

derived by Greenwood and Yule (1920) and was used as an alternative to the Poisson

distribution when the assumption of mean-variance equality is violated.

24

As for regression models, the classic linear regression model is not suitable for count data

analysis, since the assumption of normality is violated. Thus, generalized linear models, which

allow the analysis of data when the assumptions of linearity and normality are no longer met, are

employed.

The Generalized Linear Model (GLM) was first described by Nelder and Wedderburn

(1972) and has been further developed and explained by McCullagh and Nelder (1989). Instead

of modeling the mean as a linear function of the covariance in the classic linear regression, it

allows other possibilities. All GLM are specified with three components: a random component

which specifies the distribution of the output variable; a systematic component which specifies

the covariates in a linear form and a link function which connects the random component to the

systematic components. If the distribution of the output variable is normal, then the classic OLS

regression is appropriate. Besides the normal distribution, other distributions, like binomial

distributions, Poisson distribution, Negative Binomial distributions, etc. can be used.

Regarding the three components of the GLM, it is necessary to clarify the equations. The

systematic component is of the linear form of the covariates as follows in Equation 2-1:

η = 𝑥′𝛽 (2-1)

Where x𝑖 is the vector of covariance for observation i, and 𝛽 represent the corresponding

unknown parameters.

A link function connects the mean value of the output variable Y to the linear predictor η

through a function g(. ).Thus, the GLM model is expressed as Equation 2-2:

g(μ) = η = 𝑥′𝛽 (2-2)

25

Poisson Regression Model and Applications

The Poisson regression model is the most popular method for analyzing count data. It is a

specific form of the GLM which specifies the output variable Y followed by a Poisson

distribution, with the link function 𝑔(𝜇) = log (𝜇).

Thus, the probabilities of observing y𝑖 can be written as Equation 2-3：

𝑓(𝑦𝑖, μ𝑖) =µ𝑖

y𝑖𝑒−µ𝑖

y𝑖! y𝑖 = 0,1,2,3… .. (2-3)

Where μ𝑖 is parameter of the Poisson regression, which is also the mean and variance of

y𝑖 for the ith observation. Given the link function and the linear predictor η𝑖 g(μ𝑖) = log(μ𝑖) =

x𝑖′𝛽, thus we have μ𝑖 = exp (x𝑖′𝛽).

The Poisson regression has been widely used when analyzing count data. In the field of

consumption behavior, because the consumption frequency or purchase intensity is often

described as count data, Poisson regression models have been used to analyze consumers’

behavior. For example, Morland et al. (2002) used the Poisson regression when analyzing

consumers’ access to healthy food choices concerning to the distribution of food stores and food

service places. Binkley (2006) employed the Poisson regression to explore the effect of

demographic, economic, and nutrition factors on the consumption frequency of food away from

home. Cannuscio et al. (2013) using the Poisson regression to analyze the correlation between

food environment and residents’ shopping behaviors.

Problems with the Poisson Regression

Although Poisson regression models are popular when analyzing count data, the model

might not always be the best fit due to the characteristic of the Poisson regression, which

requires that the mean equals to the variance, specified by μ𝑖 = 𝐸(Y𝑖) = 𝑉𝑎𝑟(Y𝑖). The

assumption of mean-variance equality is very restrictive and easily violated. When the observed

26

variability is greater (or less) than the observed mean, then the Poisson distribution is no longer

the true realization of the data, and the data is considered to have the issue of over-dispersion

(under-dispersion). Taking the case of consumption behavior as an example, for some daily

goods like tobacco, there might be many people that choose to never consume tobacco because

they are non-smokers, yet there might be also many people that choose to consume extremely

large units per week (heavy-smokers). In this case, the data might not meet the assumption of

mean-variance equality, and the Poisson model would not be appropriate.

A special case of over-dispersion happens when there are excessive zeros in the data.

When there are abundant zero observations, the mean of the data will be closer to zero, resulting

in the violation of mean-variance equality assumption. Thus, ignoring the issue of excessive

zeros will cause biased parameter estimates and poor model fit. Using tobacco consumption as

an example again, whenever the question of consumption frequency is asked, there would be

many people who answer zero, since they are non-smokers. The same thing happens for the

consumption of food, where consumers might choose not to consume in a given period, or not

consume for reasons such as allergies or personal beliefs.

When considering the source of the excessive zeros, some research argues that the zeros

might arise from different generating processes, which is a result of unexplained population

heterogeneity (Hu et al., 2011). Generally, it is considered that the zeros can be differentiated

into two types: structural zeros, which are generated from a latent class where zero is the only

possible value, and sampling zeros, which arise from a latent class where zero happens within a

random sample of potential count responses. In the case of consumption behavior, in response to

the question “How often did you consume tobacco last month” there will be individuals who

have never smoked before (structural zeros) and individuals who are potential consumers that did

27

not choose to consume in the last month (sampling zeros). The structural zero observations are

the consumers who have a non-positive desire to consume (which can be categorized as non-

participants), and the sampling zero observations are those consumers who have positive desire

but no positive acquisition in the given period (which can be categorized as potential

consumers).

To deal with the issues of the over-dispersion and excessive zero, different models have

been used. Generally, when over-dispersion is the only issue, the Negative Binomial model will

be a better fit than Poisson regression. If only zero-inflation exists, either the Zero-inflated

Poisson model or Hurdle Poisson model are used. If both exist, then Zero-inflated Negative

Binomial and Hurdle Negative Binomial models could be used.

Negative Binomial Regression Models and Applications

When the data has the issue of over-dispersion, the Negative Binomial model is usually

considered as an alternative to the Poisson regression model, since it provides an extra parameter

to accommodate the additional variability. The Negative Binomial distribution (NB) is a gamma

mixture of the Poisson distribution. In other words, a random non-negative integer is considered

distributed as the Poisson distribution with a mean of λ, where λ is a random variable with a

gamma distribution. Thus, the NB works to allow more flexibility in accommodating variability.

For example, for the gamma distribution with shape parameter γ, and scale parameter

θ=𝜌

1−𝜌, the mass function of the Negative Binomial distribution given the gamma-Poisson

mixture written as Equation 2-4:

f(y, γ, ρ) = ∫ 𝑓𝑝𝑜𝑖(λ) ∗ 𝑓𝑔𝑎𝑚𝑚𝑎(γ,ρ)(λ)∞

0

dλ

= ∫λ𝑦

𝑦!𝑒−λ ∗ λ𝑟−1∞

0

𝑒−λ

𝜌1−𝜌

(𝜌

1−𝜌)γΓ(γ)

dλ

28

= Γ(γ+y)

𝑦!Γ(γ) 𝜌𝑦(1 − 𝜌)γ (2-4)

The standard formulation for the Negative Binomial mass function of a variable Y is

given in the following form as Equation 2-5:

f(y; k, μ) =Γ(y+k)

Γ(y+1)Γ(k) (

𝑘

𝑘+µ)𝑘(1 −

𝑘

𝑘+µ)y (2-5)

where E(Y)= μ, Var(Y)= μ +µ2

𝑘, and

1

𝑘 is defined as the dispersion parameter, and k is the

gamma scale parameter. As k increases to infinity, Var(Y) decreases to u, which is equal to E(Y),

and the distribution of Negative Binomial approaches the Poisson distribution.

For the Negative Binomial regression model, which is also a specific form of the GLM,

the link function is also the log transformation like the Poisson regression model g(μ𝑖) =

log(μ𝑖) = x𝑖′𝛽. Furthermore, as mentioned above, the Negative Binomial distribtuion converges

to the poisson distribution if k increases to infinity, thus, the Poisson regression model is nested

within the Negative Binomial regression model. As a result, the Likelihood Ratio Test or Wald

test can be used to test whether the dispersion parameter is significant.

Since Negative Binomial regression model is more flexible than Poisson regression

models accommodating data with more variability, they have also been widely used in the

analysis of consumption behavior. For example, Lesser et al. (2013) employ the Negative

Binomial model to test the association between outdoor food advertising and obesity. When

analyzing the data, the authors reject the Poisson model because of the existence of dispersion.

Han and Powell (2013) also employed the Negative Binomial model to analyze consumption

patterns of sugar-sweetened beverages in the United States.

However, although the Negative Binomialregression model can accommodate the data

with the issue of over-dispersion, it still has some limitations, especially when dealing with the

29

problem of zero-inflation. Previous research indicates that Negative Binomialregression model

does not have a good model fit for data with zero-inflation (Desjardins, 2012; Hu et al., 2011;

Lambert, 1992). Additionally, considering the potential different latent classes which generate

two types of zero, using the Negative Binomialmodel could not capture the different

characteristics of the two groups.

In the case of consumption, Negative Binomial models are very restrictive by assuming

that it is the same set of factors that influence both consumers’ decisions on participation and

consumption. Furthermore, both Poisson regression models and Negative Binomial regression

models assume that the characteristics of consumers and non-consumers have no significant

difference, thus fail to fully identify different consumer types. To investigate the different types

of zero (and consumers), a mixture model or a two-part model may improve fit.

Zero-inflated Models and Applications

Zero-inflated models refer to the models that define a mixture of two different

distributions of zeros, and are able to accommodate the issue of excessive zeros in count data.

Zero-inflated models assume that there are different latent classes in the population. Thus, the

zero observations could be generated through two different sources: “sampling” and “structural”

zeros. When applying the Zero-inflated model to the case of food consumption, observed zero

consumption will be recorded when the consumer is a genuine non-participant (Structural zero),

or when the consumers are potential consumers, and choose zero consumption as the corner

solution of a standard consumer demand problem (sampling zero). Thus, using the Zero-inflated

models will allow us to predict the existence of three different groups: genuine non-participant;

potential consumers; and active consumers with positive consumption.

30

Zero-inflated models have been developed for different models, including Poisson

regression models (Lambert, 1998), Negative Binomial regression models (Ridout, Hinde, and

Demetrio, 2001), and other models (i.e., geometric models (Mullahy, 1986)).

The Zero-inflated Poisson (ZIP) model was proposed by Lambert in 1992. It assumes a

mixture of two distributions at the point of zero: a Poisson distribution and a binomial

distribution. According to this assumption, it is assumed that with probability p, the only possible

observation is 0 (structural zero), and with probability (1-p), a Poisson random variable is

observed. The probability mass function of a ZIP model is as follows in Equation 2-6:

Pr (Y=y) ={𝑝 + (1 − 𝑝) exp(−λ) 𝐼(𝑦=0)

(1 − 𝑝)λy𝑒−λ

y! 𝐼(𝑦>0)

(2-6)

Thus, from the above Equation 2-6, zero observations can be observed from two parts:

the structural point mass component, p, and from the sampling Poisson component, (1 −

𝑝) exp(−λ). In the ZIP model, E(Y)= μ = (1 − 𝑝)λ, and Var(Y)= μ +𝑝

1−𝑝μ2.

The ZIP model is also a special case of GLM, with a logit link function for p, and log link

function for λ as follows (Equation 2-7- Equation 2-8):

Logit (p)=Log(𝑝

1−𝑝)=𝑥′𝛽 (2-7)

Log(λ) = z′𝛼 (2-8)

Where 𝑥 are the covariates for the first stage, with 𝛽 as the corresponding estimates; z are

the covariates for the second stage, with 𝛼 as the corresponding estimate. Furthermore, there is

no requirement that x=z.

The Zero-inflated Poisson model has been widely used when dealing with excessive

zeros, and there are many examples in the analysis of consumption as wee. For example, Almasi

et al. (2016) employed the ZIP model to analyze the effects of nutritional habits on dental care

31

among schoolchildren and Matheson et al. (2012) explored the influence of gender on alcohol

consumption using the ZIP model.

Additionally, Lambert (1992) extends the ZIP model to the ZIP(τ) model which allows p

and λ to be correlated with a shape parameter τ. Huang and Chin (2010) employed the ZIP(τ) to

model road traffic crashes and Calsyn et al. (2009) explored the correlation with motivational

and skill training and HIV risk using the ZIP(τ).

Just as the Poisson regression model was extended to the Negative Binomial regression

model, the Zero-inflated Poisson regression model can be extended to the Zero-inflated Negative

Binomial(ZINB) regression model as well. Even without zero-inflation, it is also possible that

the data over-dispersion happens because of greater variability of the non-zero outcomes. In this

case, instead of the ZIP model, the ZINB model is a better fit for the data.

Similar to the ZIP distribution, the ZINB distribution assumes that there is a mixture

distribution at the point of zero: a Negative Binomial distribution and a Binomial distribution.

Thus, the ZINB can be expressed as follows in Equation 2-9:

Pr (Y=y)={𝑝 + (1 − 𝑝)(

𝑘

𝑘+µ)𝑘 𝐼(𝑦=0)

(1 − 𝑝)Γ(y+k)

Γ(y+1)Γ(k) (

𝑘

𝑘+µ)𝑘(1 −

𝑘

𝑘+µ)y 𝐼(𝑦>0)

(2-9)

Where μ is the mean of the NB distribution, and 1

𝑘 is the dispersion parameter. Thus the

mean and variance of the ZINB distribution is E(Y)= (1 − 𝑝)μ，Var(Y)= (1 − 𝑝) ∗ μ ∗ (1 +µ

𝑘+

𝑝μ). Just as the NB distribution converges to the Poisson distribution as k increases to infinity,

the ZINB distribution also converges to the ZIP distribution as k increases.

There has been much research conducted employing the ZINB model. Examples of the

use of the ZINB model in consumption include Hendrix and Haggard (2015), who employed the

32

ZINB model to analyze global food prices and regime type in the developing world. Moralies

and Higuchi (2017) employed the ZINB model to explore the effectiveness of consumers’belief

about health and nutrition on their willingness to pay for fish.

Hurdle Model and Applications

The Hurdle model was first developed by Cragg (1971) as an example of truncated

models, relaxing the Tobit model by allowing separate stochastic processes for the observed zero

and positive outcomes (Yen and Huang, 1996). Different from the Zero-inflated models, the

Hurdle models are no longer a mixture of different models, but a two-part model. The first part

predicts whether the outcome is zero or not, and the second part generates the non-zero counts.

Thus, it assumes that all the zeros are from the first stage.

When modeling consumption behavior using the Hurdle count data model, there is an

assumption that individuals need to pass two stages before being observed with a positive level

of consumption: a participation decision and a consumption decision. In the first stage, the

consumer makes a decision on whether or not to participate. In the second stage, a decision on

how much/many to purchase is determined. Specifically, the Hurdle model assumes that the

participation stage dominates the consumption stage. Thus, if the consumers choose to

participate in the first step, it does not allow zero-consumption in the consumption stage.

The Hurdle model uses a binomial logistic regression model to indicate whether a count

is zero or positive (Green, 1994). If a positive outcome is realized, then a truncated at zero count

data model (Poisson/NB) is used for the positive counts. However, the first part does not have to

be the binomial logistic regression model, “there will likely exist numerous plausible

specifications of both the binary probability model and the conditional distribution of the

positives” (Mullahy, 1986). For example, in Mullahy’s research (1986), he used the Poisson

33

distribution governing the probability of observing a zero count. Thus, a generic Hurdle model is

as follows in Equation 2-10:

Pr (Y=y) ={𝑔1(0) 𝐼(𝑦=0)

(1 − 𝑔1(0)) ∗𝑔2(𝑦)

1−𝑔2(0) 𝐼(𝑦=1,2,3….)

(2-10)

Where Y is the outcome variable, 𝑔1 is the specification of the binary probability model

that governs the first Hurdle, indicating whether the outcome is zero; and 𝑔2 is the specification

of the trucated-at-zero probability generating the positive values.

There are also some popular specifications for 𝑔1 and 𝑔2, for example Green (1994)

specified the 𝑔1 as a binomial distribution and 𝑔2 as a truncated-at-zero Poisson distribution,

which provides the following form in Equation 2-11:

Pr (Y=y) = {

𝑝 𝐼(𝑦=0)

(1 − 𝑝) ∗λy𝑒−λ

(1−𝑒−λ)y! 𝐼(𝑦=1,2,3….)

(2-11)

Where p is the probability of a count being observed as zero, and λ is the parameter for

the truncated Poisson distribution. To be more specific, the link function for p is logit

transformation where Logit (p)=Log(𝑝

1−𝑝)=𝑥′𝛽, and the link function for λ is log, with Log(λ) =

z′𝛼. 𝑥 are the covariates for the first stage, with 𝛽 as the corresponding estimates; z are the

covariates for the second stage, with 𝛼 as the corresponding estimate. Furthermore, this is no

requirement that x=z.

In another example, Mullay (1986) specified both 𝑔1 and 𝑔2 as Poisson distributions

which provides the following specifications as Equation 2-12:

Pr (Y=y) ={𝑒−λ1 𝐼(𝑦=0)

(1 − 𝑒−λ1) ∗λ2

y𝑒−λ2

(1−𝑒−λ2)y! 𝐼(𝑦=1,2,3….)

(2-12)

34

Where λ1 is the parameter for the Poisson distribution governing the first Hurdle; λ2 is

the parameter for the Poisson distribution generating the positives. Both λ1 and λ2 could be

parameterized with log link function as Log(λ1) = 𝑥′𝛽, and Log(λ2) = z′𝛼.

Shonkwiler and Shaw (1996) extended Mullahy’s specification by allowing zero

observations in both the first and second stage. Thus, in Shonkwiler and Shaw’s model (Double

Hurdle count-data model1), there are two mechanisms generating zero observations: zero

observations could either happen in the first stage by choosing not consume or in the second

stage by choosing to consume zero frequency.The essence of the double Hurdle count data

model is very similar to the ZIP model, but with the first part indicating the structural zero using

a Poisson distribution specification instead of a binomial. The specification for the double-

Hurdle count data model is as follows in Equation 2-13:

Pr (Y=y)={𝑒−λ1 + (1 − 𝑒−λ1) ∗ 𝑒−λ2 𝐼(𝑦=0)

(1 − 𝑒−λ1) ∗ (1 − 𝑒−λ2)λ2

y𝑒−λ2

(1−𝑒−λ2)y! 𝐼(𝑦=1,2,3….)

= {𝑒−λ1 + (1 − 𝑒−λ1) ∗ 𝑒−λ2 𝐼(𝑦=0)

(1 − 𝑒−λ1) ∗λ2

y𝑒−λ2

y! 𝐼(𝑦=1,2,3….)

(2-13)

Where λ1 is the parameter for the Poisson distribution governing the first part, indicating

whether the zeros are structural zeros or not; λ2 is the parameter for the Poisson distribution for

the second part. Both λ1 and λ2 could be parameterized with log link function as Log(λ1) = 𝑥′𝛽

and Log(λ2) = z′𝛼. If we let p = 𝑒−λ1, then this model specification is the same as the ZIP

model.

1 The term borrowed from Shonkwiler and Shaw (1996)

35

The Poisson regression model can be extended to NB regression model, and the ZIP

model can be extended to ZINB model, the Hurdle Poisson regression model can be extended to

the Hurdle NB model as well. There have been many studies using Hurdle models ( Hu et al.,

2001; Bandyopadhyay et al., 2011; Bethell et al., 2010; Rose et al. 2006). Examples of research

using Hurdle models include Crowley, Eakins and Jordan (2012), who employed the double-

Hurdle model to analyze the lottery participation and expenditure; Jaunky and Ramchurn (2014),

who analyzed consumer behavior in the scratch card market; Jiang et al. (2012) modeled

mushroom consumption, and Bezu and Kassie (2014), estimated maize planting decisions.

Comparison of the Models

In this section, there are six count data models listed, including the Poisson, NB, ZIP,

ZINB, Hurdle Poisson (PH), and Hurdle NB (NBH) models. When models are nested within one

another, a Wald/LR test can be used to test the significance of these extra parameters. For

example, the Poisson regression model is nested within the NB, the Poisson Hurdle within the

NB Hurdle; ZIP within ZINB; and PH within the NBH. Besides these, other pairs of the models

are not inherently nested within each other. If models are not nested, they can be compared using

Vuong’s test (Vuong, 1989), Akaike Information Criterion (AIC), and Bayesian Information

Criterion (BIC).

Prior research has compared some models, such as the NB and Poisson regression models,

where research, as mentioned above, showed the NB better handles the problem of over-

dispersion (Atkins and Gallops, 2007; Warton, 2005). When the dispersion is not present,

according to Warton’s research (2005), Poisson regression models perform better than NB

models.

As for the comparisons between NB and Zero-inflated models, Lambert (1992) compared

the ZIP model to the Poisson and NB model when the ZIP model was first proposed. The

36

conclusion was that the ZIP outperformed the other two models, and the NB performs better than

the Poisson model in term of the prediction. Green (1994) compared the NB, ZIP, and ZINB

models. Based on Vuong’s test statistics, he found that the ZINB model performs the best,

followed by the NB, ZIP and Poisson models. A possible reason for this result may be because

the ZINB model could accommodate two sources of dispersion, and in the data used in this

study, the dispersion was caused mostly by unobserved response heterogeneity. This would lead

the NB model to perform better than the ZIP model. Desouhant et al. (1998) compared the NB

and ZIP models and found that the two models perform roughly similar. They conclude that

researchers need to accommodate both over-dispersion and zero-inflation in the analysis of count

data. Slymen et al. (2006) compared the ZIP, ZINB, NB, and Poisson models and found the NB

model fit better than the Poisson models. However, the ZINB and ZIP models performed nearly

the same both regarding model fit and parameter estimates, which indicates that the main issue of

the data in this study was zero-inflation, and dispersion was likely not severe in this case.

Wenger and Freeman (2008) compared the ZIP, ZINB, NB, Poisson and concluded that the

Zero-inflated models perform better than the non-inflated models and that NB formulation

models fit better than other models without the NB formulation.

The comparisons between Zero-inflated models and Hurdle models has attrated more

attention. The focus of comparison of these two models has been the sources of zero

observations. As discussed above, the Hurdle models assume there only exists one type of zero

observations, yet the Zero-inflated models assume that zero observations are coming from two

different sources. A second difference of these two models are their capability of handling the

data with zero deflation. Zero-inflated models are typically used to analyze data with zero-

inflation and have a poor fit for data with under-dispersion of zero counts, while Hurdle models

37

have better fit dealing with zero-deflation. Min and Agresti (2005) compared ZIP model with

Hurdle Poisson model using simulation study, and found that ZIP model had very poor estimate

capability when the data has the issue of zero-deflation while the Hurdle model did not.Based on

this study, it is indicated that the Hurdle models might be more general than the Zero-inflated

models. Desjardins (2013) compared the ZINB and HNB models using simulations and found

that the HNB performs better than the ZINB regarding both model fit and parameter recovery.

Gaps and Shortcomings

Although there has been much research comparing models, there have been few studies

comparing and evaluating model performance using simulation. Beyound the study by Lambert

(1992), Min and Agresti (2005) , Miller (2007) and Desjardins (2013), no one has compared the

performance of the count data models using the simulation data instead of the empirical data;

Furthermore, no one has considered the comparison of all the six count data models using

simulation study given different levels of zero proportion and over-dispersion. Next, when

compared the models, the previous study focuses mainly on model fit, without consideration of

their capability of prediction. Especially, when comparing the Hurdle models with Zero-inflated

models, their capability of capture zero observations was rarely considered.

Thus, how the different count data models perform given different levels of zero

proportion and dispersion is an area that needs further investigation. From the previous empirical

research, it can inferred that if the effect of dispersion is much larger than the effect of zero-

inflation, the NB model should perform better than the ZIP model. The question remains though,

how does the effect of zero-inflation change given different levels of dispersion? How would the

model performance change based on different levels of zero proportion given different levels of

dispersion?

38

Furthermore, when analyzing consumption behavior, it is of great interest in analyzing

different consumer types and exploring market segmentation. In this sense, in addition to the

model fit, it is also very important to comapre the models’ prediction capabilities.To my

knowledge, there has been no prior research focusing on the comparison of models’prediction

capabilities, especially when comparing Zero-inflated models to Hurdle models, which have very

different approaches to classifying zeros. An interesting point would be to determine if the Zero-

inflated model could efficiently predict the correct portion of non-consumers (structural zeros),

which is the most important utility of the Zero-inflated models. With one more step, if we allow

different levels of (structural) zeros in the Zero-inflated models, how would the performance of

the two models change?

Method

Research Questions

Based on the literature review, Two research questions will be exmained in this study.

First, under different levels of simulation, including zero proportion, and dispersion rate, how

will the six count data models (Poisson model, Zero-inflated Poisson model, Hurdle Poisson

model, and their Negative Binomial variations) perform regarding model fit. Second, we will

explore the consequences of misspecifying the distributions. In particular, we will pay special

attention to the comparison between Zero-inflated Models and Hurdle models, and evaluate the

proportion of correctly identified structural zeros for the Zero-inflated Models, and test the

consequences, if any, of misspecifying the latent classes for the zeros given different levels of

structural zeros.

To answer these research questions, a simulation study was conducted under different

scenarios. We will generate datasets from four different distributions: Zero-inflated Poisson

distribution (𝜌, 𝜇) (where 𝜌 is the proportion of structural zeros, and 𝜇 is the mean of the

39

Poisson); Hurdle Poisson (𝜋, γ); Zero-inflated Negative Binomial distribution ((𝜌, 𝜇, 𝑘) where 𝜌

is the proportion of structural zeros, 𝜇 is the mean of the Poisson, and k is the dispersion rate);

and Hurdle Negative Binomial distribution (𝜋, γ, k). After generation, we will fit each dataset

with each of the six different count-data models to compare their performances under different

true and untrue model specifications.

The simulation conditions controlled in this experiment include different levels of zero

(structural zero) proportion, and various levels of dispersion rate. To be more specific, the

zero/structural zero percentages (𝜌) will be set at different levels, and the levels of dispersion (k)

will also be controlled at different levels and compare each model’s capabilities of capturing the

zero observations, and structural zero observations. With the purpose of evaluating model

performances, the model fit, prediction bias, and proportion of correctly identified structural and

sampling zeros will be recorded and compared. What is more, the consequences of fitting a

model to a mis-specified distribution will be evaluated with special attention.

Monte Carlo Simulation

As discussed in the previous section, the generalized linear model was constructed by a

systematic component, a random component and a link function.The case model for the Poisson

regression assumes that 𝑦1, 𝑦2…𝑦𝑛 are independently, identically distritribued as follows:

𝑌𝑖~Poisson (𝜃𝑖)

Where the link function is:

log(𝜃𝑖)= 𝛽0+𝛽1 ∗ (𝑥1𝑖)+ 𝛽2 ∗ (𝑥2𝑖) (2-14)

Similarly, the negative binomial formulation of the Poisson model is the same as the

Poisson regression but with an extra parameter of dispersion.

40

The case model for zero-inflated Poisson model assumes that 𝑦1, 𝑦2…𝑦𝑛 are

independently, identically distritribued as follows:

𝑌𝑖~ZIP (𝑝𝑖, 𝜃𝑖)


Logit (𝑝𝑖)=Log(𝑝𝑖

1−𝑝𝑖)=𝛼0+𝛼1 ∗ (𝑧1𝑖)+ 𝛼2 ∗ (𝑧2𝑖) (2-15)

log(𝜃𝑖)= 𝛽0+𝛽1 ∗ (𝑥1𝑖)+ 𝛽2 ∗ (𝑥2𝑖) (2-16)

The zero-inflated negative binomial model is similar to the ZIP model, with an extra

dispersion parameter 𝑌𝑖~ZINB (𝑝𝑖, 𝜃𝑖 , 𝑘−1). The link function of ZINB model is the same as the

ZIP model.

The last set of models is the Hurdle models. The Hurdle Poisson regression model

assumes that 𝑦1, 𝑦2…𝑦𝑛 are independent and identically distributed as the distribution:

𝑌𝑖~HP (𝜋𝑖 , 𝜃𝑖)


Logit (𝜋𝑖)=Log(𝜋𝑖

1−𝜋𝑖)=𝛼0+𝛼1 ∗ (𝑧1𝑖)+ 𝛼2 ∗ (𝑧2𝑖) (2-17)

log(𝜃𝑖)= 𝛽0+𝛽1 ∗ (𝑥1𝑖)+ 𝛽2 ∗ (𝑥2𝑖) (2-18)

Similarly, the Hurdle Negative Binomial model has the same link function as the HP

model, but has one addition parameter for dispersion.

Simulation Study Design

The simulation study was designed to examine the performance of the six count data

models under different sets of simulation conditions, and in which conditions these models have

similar or dissimilar performance. In particular, in this experiment, we will evaluate the

performance of the models regarding model fit and capabilities of predicting zero observations.

What is more, the experiment is also designed to explore the consequences of fitting a wrong

41

model to a pre-specified distribution, and specifically, this experiment will give attention to the

consequences of fitting a Zero-inflated model to a Hurdle-model distribution and vice-versa.

To be more specific, in this experiment, datasets will be generated from the following

four distributions: Zero-inflated Poisson, Zero-inflated Negative Binomial, Hurdle Poisson, and

Hurdle Negative Binomial. For each distribution, the zero/structural zero proportion is generated

from a binomial process controlled by different levels of p values, and the counting process

(Poisson/Negative Binomial) will be set given known coefficients. Particularly, if the parameter

p (proportion of structural zeros) in ZIP is 0, then the distribution would be a Poisson/truncated

Poisson distribution, similarly, if the parameter p in ZINB is 0, then the distribution would be a

Negative Binomial/ Truncated Negative Binomial distribution. What is more, for the Negative

Binomial formulations, the experiment will also control different levels of dispersion to compare

model performances under different situations. Once the dataset is generated, six different count

data models will be fit for each dataset, and we will compare model performances based on the

model fit and capability to capture zero and structural zeros. We will also evaluate the capability

of coefficient recovery (for the counting process), and relative/absolute bias.

Regarding simulation conditions, the following scenarios will be considered in this study:

distributions, model types, levels of dispersion, and varying levels of zero/structural zero

proportion. In total, there were 4 distributions (ZIP, ZINB, HP and HNB), 6 models (Poisson,

NB, ZIP, ZINB, HP, HNB), 4 levels of dispersion (0.5, 1, 2, 4) and 5 levels of (structural) zero

proportion (0.1, 0.3, 0.5, 0.7, 0.9). Here, the level of zero proportion indicates the total zero

proportion for Hurdle model distributions and the structural-zero proportion for Zero-inflated

model distributions. What is more, the dispersion level will only exist for Negative Binomial

formulations. In total, this results in 4*5*(3+3*4)=300 different scenarios.

42

Sample size is another important concern when analyzing different models. When the

sample size is too small, results may not be consistent since it is not valid to assume it is

asymptotically normal, however, when the sample size is too large, the computation time is

significantly increased. Based on previous research, Lambert (1992) considered sample sizes of

25, 50, and 100, but singularities and non-convergence occurred in the experiment. Particularly,

when the sample size is small (like 25 or 50), the situation of near perfect discrimination (when

there is a hyperplane that divides all 0s on one side and all the 1s on the other side) is more likely

to happen. Thus to avoid these issues, in the experiment, the sample size was set to be 250 in all

cases.

The simulation size is also important considering the validation of simulation results. If

there are too few replications, results may not be consistent, and as the size of simulation is

increased, the consistency of results would also increase. In this study, the simulation size is set

to be 1,000, similar to Civettini and Hines’s (2005) research which analyzed model

misspecification with the ZINB model.

The “glm” procedure in R was used for the Poisson and Negative Binomial Regression

analysis, the “pscl” library in R (Zeileis et al.,2008) was used for the Zero-inflated Poisson

model, Zero-inflated Negative Binomial model, and Hurdle Poisson model analysis, and the

“actuar” library in R (Dutang et al.,2008) was used for the Hurdle Negative Binomial model

analysis. For each model regressed on each dataset, results including Loglikelihood and AIC

statistics, relative bias of E(Y|X) and predictions of structural zero/zero are saved for the further

analysis.

43

Model Evaluation of the Simulation Studies

To assess the performance of the six different models under different scenarios, we will

employ various measures related to model fit and their capabilities of predicting zero

observations. Specifically, for model fit, we will employ the Loglikelihood statistics, AIC

statistics, and the relative bias for the E (Y|X).

The AIC statistics is defined as 2k-2log(L),where k is the number of parameters in the

model, and L is the likelihood of the model. In general, models with the lowest AIC are favored.

In this analysis, since the relationship of the six count models are not all nested. Thus, the AIC

statistics are utilized in this analysis.

Besides the Loglikelihood and AIC statistics, similar like what Lambert (1992) did in her

study, we will also employ the relative bias for the E(Y|X) to examine the model fit. The relative

bias for the E (Y|X) is defined as absolute value of [𝐸(�̂�|𝑋) − E(Y|X)]/E(Y|X) .For each

simulation, the average of the relative bias was calculated, and then aggregated across all the

simulations. In particular, in this analysis, in order to evaluate the prediction more accurately, the

10-fold cross-validation was employed.

As for the capabilities of predicting zero observations, their predictions of zero and

structural zero will be saved in each scenario. While the Zero-inflated models will be able to

predict both the structural and sampling zero observations, thus results will be recorded for their

predictions for both two types of zero. Considering the models’ attributes, the Hurdle model will

only be able to predict one type of zeros, thus results will be only recorded for their predictions

for the zero observations. What is more, to better assess the prediction of zero, the 10-fold cross

validation method was employed in this analysis.

44

Since we generate dataset from four different distributions, and for each distribution, we

will run all the six count data models, thus we would give special attention to the performances

of the models as a consequence of using a wrong model given a true distribution, and

particularly, we will be very interested in exploring the capabilities of Hurdle models to capture

zero when the true model is Zero-inflated model, and vice versa.

Data Generating Mechanism

Data for ZIP/ZINB were generated using a similar procedure to what was employed in

Lambert (1992), but the proportion of structural zero p was controlled in this experiment.The

data-generating mechanism is as follows:

Calculate 𝜃𝑖 based on the specifications of the coefficients of count process.

Generate a Uniform (0,1) random vector U of length n

If 𝑈𝑖 ≤ p (the given level of structual zero), then 𝑦𝑖=0, otherwise 𝑦𝑖~ Poisson ( 𝜃𝑖)/ 𝑦𝑖~

NB (𝜃𝑖 , 𝑘) (k is the given level of dispersion)

The data-generating mechanism for the Hurdle Poisson/Hurdle NB distribution is as

follows:

Calculate 𝜃𝑖 based on the specifications of the coefficients of count process.

Generate a Uniform (0,1) random vector U of length n

If 𝑈𝑖 ≤ p (the given level of structual zero), then 𝑦𝑖=0, otherwise 𝑦𝑖~ Zero-Trucatedd

poisson (𝜃𝑖)/ 𝑦𝑖~ Zero-Trucated NB (𝜃𝑖 , 𝑘) (k is the given level of dispersion)

Results

Pseudo-Population Zero-inflated Poisson Model

At first, we focus on the results when the pseudo-population is Zero-inflated Poisson Model.

The table of convergence rate under the five different levels of structural zero is displayed below

in Table 2-1. We could see that as the zero proportion gets larger, the convergence rate is getting

45

smaller. When the structural zero proportion gets to be as high as 90%, the convergence rate is

only 86.1%.

Model fit

The Loglikelihood statistics for the six different models at different levels of zero

proportion when the true model is ZIP are displayed in Table 2-2. The AIC statistics for the

different models at different levels of zero proportion when the true model is ZIP are displayed

in Table 2-3. When the proportion of structural zeros is 10% or 30%, the Zero-inflated Negative

Binomial (ZINB) regression model has the lowest AIC rather than the true ZIP model. However,

the true ZIP model has the best fit (because the ZINB model has one extra parameter (dispersion

parameter)). What is more, we also find that when the zero proportion gets larger (50%, 70% and

90%), the true ZIP model has the lowest LL. When the proportion of zeros is less than 90%, the

ZINB model is the best among the five alternative models. When the zero percentage increases

to 90%, the HP model is the best alternative model given both the LL and AIC statistics.

Another finding is that as the zero proportion increases, the model fit for the Zero-inflated

and Hurdle models improve, yet the model fit of Poisson and Negative Binomial regressions get

worse. This is an indicator that the Poisson and Negative Binomial models struggle to handle the

issue of zero-inflation.

Relative bias of E(Y|X)

The relative bias for E(Y|X) in the case when the true distribution is Zero-inflated Poisson

distribution is displayed in Table 2-4. Similar to the model fit statistics, we observe that when the

true model is ZIP, the ZINB model performs well in predicting Y. When the percentage of

structural zeros is smaller than 50%, the true ZIP model has the best performance in predicting

Y, yet when the percentage of structural zeros is greater than 50%, the mis-specified ZINB

model is better than the true model. Across all the different levels of zero proportion, the average

46

of the relative bias of the ZIP model is -0.43, and the average of relative bias of the ZINB model

is less, at -0.35.. The Poisson and Negative Binomial (NB) models have significant bias in

predicting Y (relative value larger than 1). What is more, the Hurdle models are the worst model

regarding the relative bias in this case (-1.77).

When comparing the results row by row, as the proportion of structural zeros increases, the

relative bias of the true model ZIP increases, and the relative bias of the mis-specified ZINB

model also increases. When the proportion of structural zeros reaches 90%, the relative bias for

all of the six models are large, including the true ZIP model.

Abilities of capturing zero observation

Another important feature is whether or not Zero-inflated models are capable to predict

the zero observations and structural zero observations. The observed zeros in each dataset,

together with the predicted zero observations and structural zero observations from different

models are displayed for different proportions of structural zeros in Table 2-5. When the

proportion of structural zeros (p) is equal to 0.1, the mean of the observed zero observations is

83. For the six models, both the Zero-inflated and Hurdle models capture the zero observations

accurately. The Hurdle models predict the zero observations exactly equal to the observed zeros,

due to the models’ attributes. Both the ZIP and ZINB models predict the number of zero

observations is 84, only one unit difference from the observed number. At the same time, neither

the Poisson nor Negative Binomial (NB) models capture enough zero observations. As p

increases to 0.7, both the Hurdle models and Zero-inflated models continue to predict the zero-

observations very accurately.

When only focusing on the prediction of structural zeros, given that Hurdle models only

allow one type of zero, they are unable to identify the different types of zero. Zero-inflated

models allow zero observations to come from two separate processes, allowing the

47

differentiation of the zeros. Given n=250, the expected number of structural zeros is 250*p.

When p=0.1, 0.3, 0.5, 0.7 and 0.9, the expected number of structural zeros are 25, 75, 125, 175,

and 225, respectively. Both the ZIP and ZINB model provide the same results when p is 0.1, 0.3,

and 0.5. When p is 0.1, the ZIP and ZINB overestimate the percentage of structural zeros, and

when p=0.3 and 0.5, they predict the structural zeros accurately. When p increases to 0.7 and 0.9,

the true ZIP model still provides accurate prediction, while the ZINB model underestimates the

structural zero percentage.

From this experiment, in the case where the true model is the Zero-inflated Poisson

regression model, we find the Zero-inflated models have powerful capabilities to predict both

zero observations and structural zero observations. Because of the models’ design, the Hurdle

models can always estimate the zero observations accurately, yet they fail to capture the existing

structural zero observations. Thus, if the research has underlying assumptions regarding the

existence of structural zeros, the Zero-inflated models should be considered over Hurdle models.

When comparing the Zero-inflated models in more detail, results indicate that the ZIP

predicts structural zeros very accurately when there exists a significant portion of structural

zeros, while when the proportion of structural zeros is comparatively small, it might overestimate

the percentage. The misspecified ZINB model tends to underestimate the structural zeros when

the percentage of structural zeros increases.Both the Poisson and Negative Binomial regression

models fail to capture abundant zero observations, model fit decreases as zero observations

increase. Thus, when faced with zero-inflation, the modified Poisson regressions should be

considered.

Pseudo-Population Hurdle Poisson Model

Next, we consider the case when the pseudo-population is the Hurdle Poisson Model.

Convergence rates under the five different levels of zero proportion are shown in Table 2-6. As

48

the proportion of zeros increases (up to 70% zeros), the convergence rate decreases. When the

proportion of zeros approaches 70%, the convergence rate reaches a low of 71.1%.

Model fit

The log likelihood statistics for the six different models with different proportions of zeros is

displayed in Table 2-7. The true Hurdle Poisson (HP) model has the lowest log-likelihood when

the proportion of zeros is 10% and 90%. At other proportions of zero, the Hurdle Negative

Binomial (HNB) model has the lowest log-likelihood rather than the true model.

However, when focuing on the AIC statistics (Table 2-8), the true HP model has the best

model fit, with the HNB model being the best among the remaining five misspecified models

when the p values are relatively small. When p increases to 90%, the ZIP model becomes the

second best model in terms of the model fit. In this scenario, the Poisson regression model has

the worst model fit across all the five proportions of zero. Additionally, the model fit of the true

HP model and the Zero-inflated models improve when the proportion of zeros increases.


The relative bias for E(Y|X) in the case when the true distribution is Hurdle Poisson is

shown in Table 2-9, with each row indicating a different scenario of zero-proportion.

Results of examining relative bias indicate that when the true model is HP, the HNB model

is the best alternative model in predicting Y when the proportion of zeros is small (less than

70%). However, when the proportion of zeros increases to 90%, neither the HP nor the HNB

have good model prediction, and the relative bias increases over 100%. In this situation, the ZIP

model has the lowest bias (only when the proportion of zeros is 90%). Because the prediction for

the HP model is so poor when zeros make up 90% of the data, the overall average of relative bias

is larger than that of the ZIP model. The results also indicate that both the Poisson and NB

models have a significant bias in predicting Y.

49


In addition to model fit, we are concerned with the capability of the models to correctly

predict the zero observations. Because the datasets are generated from the Hurdle Poisson

models, which only has one single process of generating zeros, we only compare the six models’

prediction of zero observations.

The observed and predicted zeros for the six different models are displayed in Table 2-

10. Both the HP and HNB exactly predict the zeros. However, when we turn to the prediction of

the Zero-inflated models, when the proportion of zero observations is 10%, 30%, 50% and 70%,

both the ZIP and ZINB models significantly overestimate the zero observations. When the p-

value is 90%, the ZIP and ZINB models predict the zero observations very accurately, the same

as the observed number. Another finding is that as the proportion of zeros increase, the

difference between the predictions of the ZIP and ZINB from the true zero observations

decreases, which indicates that when the proportion of zeros incease, the bias of the Zero-inflated

models of predicting zero observations decreases. There were no differences between the ZIP

model and ZINB model in this case regarding their capabilities of predicting zero.

As the Hurdle models can predict the zero observations accurately when the true model is

HP, and the Zero-inflated models (both ZIP and ZINB) overestimate the proportion of zeros

(especially when the proportion of zero is small), in research, if the researcher believes that the

true model is Hurdle Poisson distribution, using Zero-inflated models, could cause significant

bias.

Pseudo-Population Zero-inflated Negative Binomial Model

Next, we analyze the results when the pseudo-population is the Zero-inflated Negative

Binomial (ZINB) Model. Different from the previous two scenarios, the Negative Binomial

50

distribution will have one more condition to be controlled – the dispersion rate. Thus, in this

case, both the proportion of zeros and the levels of dispersion will be controlled and analyzed.

The convergence rate for the five different levels of structural zeros and four levels of

dispersion are displayed in Table 2-11.As the proportion of structural zeros increases, the

convergence rate decreases. Additionally, as the dispersion rate increases, the convergence rate

improves, which indicates that models with larger variance (larger dispersion rate) will converge

more easily.

Model fit

The Loglikelihood and AIC statistics for the six models when the data is a ZINB distribution

with dispersion rate of 0.5 are shown in Table 2-12 and Table 2-13. In this case, for all five

proportions of structural zeros, the true ZINB regression model has the lowest log-likelihood,

with the Hurdle Negative Binomial model as the best alternative. When the true model has the

Negative Binomial formulation, both the ZIP and HP models have poor model fit, which

indicates that the Poisson distribution does not handle data with dispersion well. Thus a pre-test

for dispersion should be conducted before selecting models.

The Log-likelihood and AIC statistics are displayed for the situations when the dispersion

rate is equal to 1, 2 and 4 in Table 2-14 to Table 2-19. Similar results were found in each of the

scenarios that the true ZINB model has the best model fit over all the different cases, and the

HNB model is the best alternative to the misspecified models. What is more, over all of the

different situations, the models are improve when the percentage of structural zeros increases.

Examining the Loglikelihood and AIC statistics for various levels of dispersion rates case

by case shows that as the dispersion rate increases, the model fit improves for both the ZINB and

HNB models. Thus, we find that the ZINB and HNB model can handle data well when there

exists the issue of the zero-inflation and dispersion. The average AIC of the ZINB model is

51

displayed in Table 2-20, showing that the model fit of ZINB improves as the proportion of zeros

increases, while the model fit decreases as the dispersion rate increases.


The relative bias for E(Y|X) in the case when the true distribution is the Zero-inflated

Negative Binomial distribution with dispersion rate=0.5 is shown in Table 2-21. Each row in the

table indicates a different scenario of zero-proportion.

When the true model is ZINB, the ZIP model is the best alternative model in predicting

Y, which is different from the results for AIC statistics. When the percentage of structural zeros

is 10% , the true ZINB model has the best prediction, yet when the zero proportion increases to

30%, 50% and 90%, the misspecified ZIP model has better prediction than the ZINB model.

As the proportion of zeros increases, the relative bias of the true ZINB ZINB increases.

The results indicate that the Hurdle models have a significant bias, in this case, the average of the

relative bias of both the HP and HNB models are substantially above 100%.

The relative bias when the true model is ZINB, and dispersion rate is 1 ,2 and 4 are

shown in Table 2-22 to Table 2-24. Similar to when the dispersion rate is 0.5, the ZINB model

has the best prediction of Y when the percentage of zeros is relatively small, while when the zero

proportion increases to 90%, the misspecified ZIP model has better prediction than the true

model. Using results from all dispersion levels, we findthat the ZINB model tends to have a

relatively better prediction as the dispersion rate increases. What is more, across all senarios,

results show that the Hurdle models have poor prediction of Y when the true model is ZINB.

The average relative bias of the ZINB model at different combinations of zero proportion

and dispersion rate are shown in Table 2-25. As the dispersion rate increases, the prediction of

the true ZINB model improves, especially when the proportion of zeros is smaller (10%, 30%,

52

50% and 70%). However, when the proportion of zeros is 90%, the ZINB model has good

prediction when D=0.5, and as D increases, the prediction is gets worse.


The predicted zero observations and structural zero observations from each of the six models

given different proportions of structural zero observations and dispersion rates when the true

model is ZINB are shown in Tables 2-26 –2-29. When the dispersion rate is 0.5 and the

percentage of structural zeros is 0.1, the observed zero observations have a mean of 120.

Comparing the six models, both the Zero-inflated and Hurdle models capture the zero

observations accurately. However, both the Poisson regression and the Negative Binomial

regression models underestimate zero observations. As the percentage of zeros increases to 0.9,

the HNB and ZINB models continue to predict zero observations accurately, while the ZIP

model overestimates zero observations when the percentage of zeros is 10%, 30% and 50%. This

may be related to the failure to account for the effects of dispersion.

In addition to predicting the number of zero observations, the capability of the prediction of

structural zeros is also of interest. When p=0.1, 0.3, 0.5, 0.7, and 0.9, the expected number of

structural zeros is 25, 75, 125, 175, and 225, respectively (given n=250). When the dispersion

rate equals 0.5, the true ZINB model was likely to overestimate structural zeros when the

proportion of zeros is 10% and was more likely to underestimate the structural zeros when the

proportion of zeros increases.

When the dispersion rate equals to 1, the true ZINB model has better prediction of structural

zeros than the previous case, yet it still underestimates structural zeros when the proportion of

zeros increases to 70% and 90%, and overestimates structural zeros when the proportion of zeros

is 10%. Again, the misspecified ZIP model will overestimate the proportion of structural zeros.

53

When the dispersion rate increases to 2 and 4, results are similar. The true ZINB model is

more likely to overestimate the structural zeros when the proportion of zeros is small and is more

likely to underestimate the structural zeros when the proportion of zeros increases, but the

prediction bias is relatively small. The misspecified ZIP model overestimates the proportion of

structural zeros significantly across all the different proportions of zeros.

When the true model is ZINB, regardless of dispersion rate, it indicates that both the Zero-

inflated and Hurdle models predict zero observations well in this case. Regarding the prediction

of strucutral zeros, the true model ZINB model tends to slightly underestimates structural zeros

when the percentage of zeros is small, and overestimates structural zeros when the percentage of

structural zeros gets large, however the ZIP model is likely to significantly overestimate the

structural zeros, which is possibly because the ZIP model could not capture the effect of

dispersion, thus credits all the dispersion to the possible existence of many zero. What is more,

the prediction error gets smaller as the dispersion rate gets larger. Thus in the real research, it is

important to test the data dispersion before making decisions on which model to employ, since

using the ZIP model with when the data has large variance will cause significant biases in

predicting structural zero.

Pseudo-Population Hurdle Negative Binomial Model

Finally, we analyze the results when the pseudo-population is the Hurdle Negative

Binomial (HNB) Model. The convergence rates for the five different proportions of structural

zeros and four levels of dispersion are displayed in Table 2-30. Different from the previous

situations, the model has the smallest convergence rate when the proportion of zeros is 70%, and

when the proportion of zeros increases to 90%, convergence improves. As the dispersion rate

increases, the convergence rate improves, while when the proportion of zeros increases up go

70%, the convergence rate is gets worse.

54

Model fit

The Loglikelihood and AIC statistics when the data is Hurdle Negative Binomial with a

dispersion rate of 0.5 are shown in Tables 2-31-2-32. In this case, when the proportion of zeros is

10%, 30%, 50% and 70%, the true HNB model has the lowest log-likelihood. The Zero-inflated

Negative Binomial (ZINB) model is the best alternative among the misspecified models. When

the proportion of zeros increases to 90%, the misspecified ZINB model behaves even better than

the true model, which indicates that given a high percent of zeros (90%), the HNB model does

not handle the dataset as well.

When the true model has the Negative Binomial formulation, both the ZIP and HP

models have poor model fit, indicating, not surprisingly, the Poisson formulation has poor model

fit when the data has the issue of dispersion. What is more, when the proportion of zeros is 10%,

the Negative Binomial model behaves better than the Zero-inflated Poisson and Hurdle Poisson

models. As the proportion of zeros increases, the model fit of the NB model gets worse.

Similar results are found when the dispersion rate is 1, 2, and 4 in Table 2-33 to Table 2-

38. When the proportion of zeros is 10%, 30%, 50% and 70%, the true HNB model has the best

model fit. When the proportion of zeros increases to 90%, the misspecified ZINB model has

better model fit than the true model.

The AIC statistics for the true HNB model given different proportions of zeros and

dispersion rates are shown in Table 2-39. When focusing on the HNB model itself, we find that

as the dispersion rate and the proportion of zeros increase, the model fit improves.


The relative bias for E(Y|X) in the case when the true distribution is Hurdle Negative

Binomial at different dispersion rates are shown in Table 2-40 to Table 2-43.

55

When the true model is HNB with dispersion rate equal to 0.5, it is not the true HNB

model that has the best prediction of Y, but the misspecified HP when the proportion of zeros is

smaller (10%, 30% and 50%). When the zero proportion increases to 70% and 90%, it is the

misspecified ZIP model model that has the smallest bias. When the proportion of zeros is

relatively small (from 10% to 50%), the true HNB model is the second best model, with relative

bias from 10% to 30%. However, when the proportion of zeros increases to 70% and 90%, the

true model HNB has very large bias. Besides, we could also find that as the proportion of zero

getting larger, the relative bias of the true model HNB is getting larger.

When the dispersion rate increases to 1, the true HNB model has the least relative bias

when the proportion of zeros is relatively small. As the proportion of zeros increases, the

misspecified HP model has the least relative bias until the proportion of zeros is 50%, and the

ZIP model has the least relative bias when the proportion of zeros increases to 70% and 90%.

Similar results were found when the dispersion rate is 2 and 4. When the proportion of

zeros is relatively small, the true HNB model has the least bias, and the HP model is the best

alternative among the remaining five misspecified models. When the proportion of zeros

increases to 50%, the HP model has the least bias rather than the true HNB model. As the

proportion of zeros increases to 90%, the ZIP model has the least relative bias, and the ZINB

model is the second best. Neither the true model HNB nor the HP model perform well regarding

relative bias when the proportion of zeros is large.

When focusing the performance of the true model HNB alone (Table 2-44), when the

dispersion rate is 0.5 and 1, the relative bias is increases as the proportion of zeros increases.

Secondly, it also indicates that the HNB model has a significant bias in predicting E(Y) when the

proportion of zeros is large.

56


Regarding the model’s ability to predict zero observations for the six models given the

true HNB model, the observed zeros in each dataset, together with the predicted zero

observations from the six different models are displayed in Tables 2-45 to 2-48 for different

proportions of zero observations, and under different dispersion rates. Both the Hurdle Poisson

and Hurdle Negative Binomial models capture the zeros accurately across all the five proportion

of zeros. At the same time, both the Zero-inflated Poisson (ZIP) and Zero-inflated Negative

Binomial (ZINB) model are more likely to overestimate the zero observations, especially when

the proportion of zeros is small (10%, 30%, and 50%). As the proportion of zeros increases, the

Zero inflated models prediction improves. When the dispersion rate is 0.5, the ZINB model

overestimate the proportion of zeros more than the ZIP model.

When the dispersion rate increases to 1 and 2, results are similar. The Hurdle models

capture zero observations accurately, and the Zero-inflated models tend to overestimate the zero

observations when the proportion of zero observations is relatively small (10% and 30%). When

the dispersion rate increases to 4, the Hurdle models again capture the zero observations very

accurately. However, different from the results from the previous situations, the Zero-inflated

models do not overestimate the zero proportions as much as previously. Thus if the zero-

prediction is compared under different levels of dispersion rates, it indicates that higher rates of

dispersion leads to less bias caused by using the misspecified models (ZIP and ZINB).

Compare the Mis-specified Models across All the True Models

The mean AIC statistics of the four modified count data models under four different data

generating process are shown in Table 2-49. The columns indicate the true data generating

model, and the rows indicate the model that is employed. The average of the AIC statistics of

57

each model given each data generating process across all the different proportions of zero and

dispersion rates are shown.

Across all four model generating scenarios, it is always the true model that has the lowest

average AIC statistics. Among the true models, the ZIP model has the lowest average AIC

statistic.

When examining the misspecified models, when the true model is ZIP, the best

alternative model is ZINB. In this case, the AIC of the HP model is larger, especially when the

proportion of structural zeros is small, indicating a the HP does not fit well when the ZIP is the

true model.When the true model is ZINB, the best alternative model is HNB, and in this case the

AIC statistics of both ZIP and HP model are large. Similar results were found when the true

model is HP, the best alternative model is HNB, and the misspecified ZIP model has the worst

model fit. When the true model is HNB, the best alternative model is ZINB, and both HP and

ZIP have significantly larger AIC statistics.

Overall, the Negative Binomial formulation has good fit based on the AIC statistics when

the data has significant variance (with the Negative Binomial formulation). However, when the

true model has the Poisson formulation, the Negative Binomial formulation still has good fit

based on AIC models.

The average relative bias of the four modified count data models under four different data

generating processes are shown in Table 2-50. Each column indicates the true data generating

model, and each row shows the model that is employed. The average relative bias of each model

given each data generating process across all the different proportions of zeros and dispersion

rates are shown.

58

Across all four model generating scenarios, it is not always the true model that has the

least bias. Overall, the ZIP model has the lowest prediction bias of Y. The results also indicate

that models with Negative Binomial formulation are observed to be highly biased. This is

because the models with Negative Binomial formulation have poor prediction when there are

many zero observations.

When considering the possible impact of employing misspecified models, the ZINB

model has the least bias when the true model is ZIP, with the average bias smaller than the true

ZIP model. Both the ZIP and ZINB models perform well, and better than the true model,when

the true model is HP. When the true model is ZINB and HNB, the ZIP has the least bias.

Overall, models with the Negative Binomial formation have large bias in predicting the

mean of Y. Among all the different data generating processes, the ZIP model performs

reasonably well in all cases.

At last, in terms of predicting zero observations, the results are highly dependent on the

assumption of the source of zeros. If only one type of zero exists, Hurdle models will always

predict the zero observations accurately, while the Zero-inflated models tend to overestimate

zero observations. If two types of zero exist, both Hurdle models and Zero-inflated models

predict zeros accurately, however, the Hurdle models fail to differentiate the two types of zero.

Conclusion

This research was conducted to compare the performance of the most commonly used

count data models given different data-generating processes, proportion of zeros, and data

dispersion. Although Poisson regression techniques are the most popular when dealing with the

count data, the assumption of “mean equals variance” of Poisson regression is violated with the

existence of many zeros. As a result, modified Poisson regression models have been designed

and employed in count data analysis with zero-inflation issues. The most commonly used

59

modified models include the Zero-inflated Poisson regression (ZIP) and the Hurdle Poisson

regression (HP) models. Concerning the possible issue of over-dispersion, both the Zero-inflated

and Hurdle models can also be estimated using a Negative Binomial distribution.

The comparative performance of these commonly used count data models has rarely been

examined using simulation studies in previous research. Additionally, there is even less research

that compares these models while allowing different variations of the proportion of zeros and

dispersion of the data.

This research addresses the knowledge gap by comparing the performance of the six

commonly used count data models using simulation and quantifying the models’ capabilities of

predicting zero observations under different simulation conditions. These simulation conditions

included four different data generating distributions, five different proportions of zeros, and four

different dispersion rates. Special attention was given to differentiating the amounts of different

types of zeros to allow the evaluation of the proportion of correctly identified structural zeros for

the Zero-inflated Models, and test of the consequences, if any, of misspecifying the latent classes

for zeros given different levels of structural zeros.

To answer these questions, a Monte-Carlo experiment was employed, and thestatistics

evaluated include the Loglikelihood, AIC, relative-bias of the prediction mean, and predictions

of zero observations. This experiment was designed to provide practical recommendations to

researchers when choosing among the count data models.

The conclusion will be discussed in two sections,first, the models’ performance under

four different Pseudo-populations will be discussed, together with the influence of proportion of

zeros and dispersion in each senario. Second, we will focus on the discussion about the choice of

different models based on this research.

60

Model Performance Given Four Pseudo-Populations

True model - Zero-inflated Poisson Model

In terms of model fit, when the true model is the Zero-inflated Poisson Regression Model

(ZIP), the true model has the best model fit. Among the remaining misspecified models, the

Zero-inflated Negative Binomial (ZINB) model is very close to the true model, followed by the

Hurdle Possoin (HP) and Hurdle Negative Binomial (HNB), which remain close to the true

model, especially when the percentage of zeros is higher. When the ZIOP is the true model, the

Poisson and Negative Binomial regression models have very poor model fit, indicating that they

do not handle cases when the problem of zero-inflation exists.

The true ZIP model also performs well in terms of the prediction of Ywhen the percentage

of zeros is relatively small. However, as the percentage of zeros increases, the misspecified

ZINB model has smaller bias than the true model. The HP and HNB models have large bias,

especially when the proportion of zeros is large.

When comparing the models’ capacity to predicting zeros, the ZIP, ZINB, HP, and HNB all

predict the observed zero observations accurately. However, since the Zero-inflated models

allow for different generating processes for structural and sampling zeros, . both the true ZIP

model and ZINB model accurately capture the structural zeros when the percentage of zeros is

relatively small. Asthe percentage of zeros increases, the ZINB model tends to underestimate the

structural zeros. Due to the assumptions of the models, Hurdle models fail to differentiate the

two types of zeros.

What is more, focusing on the true ZIP model itself, as the percentage of zeros increases,

model fit improves, yet the prediction of average of Y is gets worse. As the number of zero

observations increases, is becomes more difficult for the model to converge.

61

True model - Hurdle Poisson Model

When the true model is the Hurdle Poisson Regression Model (HP), the true model has the

best model fit, and the HNB model is the best among the five misspecified models, which is very

close to the true model. Following HNB, the Zero-inflated Poisson (ZIP) and Zero-inflated

Negative Binomial (ZINB) modelsare close to the true model especially when the percentage of

zeros is high. Again, in this case, neither the Poisson nor Negative Binomial regression model

have good model fit, indicating that they do not handle data when zero-inflation exists.

Regarding the relative error rate of the prediction of Y, the true model HP is not always the

best model. When the percentage of zeros is relatively small, the Hurdle Negative Binomial

(HNB) Model has the least bias. When the percentage of zeros is extremely large (greater than

90%), the ZIP model has the least bias, and the true HP model HP has large bias. Because of the

large bias when the percentage of zeros is large, on average, it is the ZIP model, not the true HP

model, that has the least average bias.

When comparing the models’ capacity of predicting zeros, we find that both the true HP

model and HNB predict the observed zero observations accurately.The Zero-inflated models are

likely to overestimate zeros in this case.

Overall, as the percentage of zeros increases, model fit for the true HP model improves, but

the prediction of the average of Y is gets worse. In addition to poorprediction of Y when there

are many zero observations, the model is gets more difficult to converge.

True model - Zero-inflated Negative Binomial Model

When the true model is the Zero-inflated Negative Binomial (ZINB), the true model ZINB

has the best model fit, and the HNB model is the best among the five misspecified models.The

ZIP and HP models have very poor fit across all the different proportions zeros and dispersion

rates. Regarding the impact of the dispersion, both the true ZINB model and HNB get worse as

62

the dispersion rate increases. Again in this case, neither the Poisson nor the Negative Binomial

regression models have good fit.

Regarding the relative error rate of the prediction of Y, the true ZINB model is not always

the best. When the percentage of zeros is relatively small, the true ZINB model tends to have the

least bias, and the ZIP model performs nearly as well. However, when there are relatively more

zeros, the misspecified ZIP model has the least bias. Regarding the impact of dispersion rate, the

true model has the least bias when the dispersion rate is 0.5 and 4, and the misspecified ZIP

model has the least average bias when dispersion rate is 1 an 2. Both the HP and HNB models

have very large bias, especially when the percentage of zeros is greater than 50%.

When comparing the models’ capacities for predicting zeros, the true ZINB model, together

with the HNB, ZIP and HP model all predict the observed zero observations accurately.

However, the true ZINB model tends to overestimate structural zeros, while the misspecified ZIP

model tends to underestimate structural zeros.

As the percentage of zeros increases, the model fit of the ZINB improves. Asthe dispersion

rate increases, the model fit gets worse. In terms of the relative bias, the model’s prediction is

decreases with an increase in the proportion of zeros.

True model - Hurdle Negative Binomial Model

Finally when the true model is Hurdle Negative Binomial (HNB), the true HNB model has

the best model fit when the percentage of zeros is relatively small. When the percentage of zeros

increases to more than 70%, the misspecified ZINB has better model fit. The ZIP and HP models

have very poor fit across all the different levels of zero proportion and dispersion rate.

Considering the relative error rate of predicting Y, when the proportion of zeros is relatively

small, the misspecified HP has the least bias, and the true model HNB is the second best. When

the proportion of zeros increases, the ZIP model has the least bias, and both the true HNB and

63

HP models have very large prediction errors. Across dispersion rates, the ZIP is always the least

biased when P(zero)>50%, and the true HNB model has very high bias when P(zero)>50%.

Regarding the models’ capacity for predicting zeros, the true HNB model and the HP model

predict the observed zero observations accurately. Both the ZIP and ZINB models tend to

overestimate zero observations.

As the percentage of zeros increases, model fit improves, while the prediction of Y gets

worse. As the dispersion rate increases, model fit improves, and no significant patterns are

captured as the dispersion rate changes.

Comparison between Different Models

Poisson formulation versus Negative-Binomial formulation

To better understand the models, we will compare models with the Poisson formulation to

models with the Negative Binomial formulation under four different model distributions. The

comparison will be conducted based on both the model fit and relative bias of prediction.

In terms of the model fit, when the true model is ZIP, the best alternative model is ZINB,

however, when the true model is ZINB, the best alternative model is HNB instead of the ZIP (in

this case the AIC statistics of both ZIP and HP model are extremely large). Similarly, when the

true model is HP, the best alternative model is HNB, and the misspecified ZIP model does not fit

well. When the true model is HNB, the best alternative model is ZINB, and both HP and ZIP

have significantly poor model fit. Thus when the true model has a Poisson formulation, its

Negative Binomial formulation will be a good alternative, yet when the true model has a NB

formulation, the misspecification of the model with Poisson formulation will cause poor model

fit.

The models can also becompared based on the relative bias of E(Y|X). When the true model

is ZIP, the ZINB has the second least bias following the true model and when the zero proportion

64

increases, the misspecified ZINB model has smaller relative bias than the true ZIP model. When

the true model is HP, the HNB has the least bias except for the true model, and the bias is

relatively small especially when the zero proportion is small. However, when the true model is

ZINB, the alternative best model is the ZIP; and when the true model is HNB, the alternative

best model is the HP. Overall, models with Negative Binomial formulation are observed to be

highly biased when predicting Y. This is likely related to models with Negative Binomial

formulation having poor prediction when there are many zero observations.

Thus, given that researchers have long been concerned about the choice of models with

either Poisson or NB formulations, results from this study strongly suggest that a test of

“dispersion” is needed before determining the model. If the dataset does have “dispersion”,

misspecifying the models with Poisson formulation will cause bad poor fit. On the other hand,

when the true model has the Poisson formulation, using a model with NB formulation will not

cause severe bias except when there are many zero observations.

Another aspect of the data that can impact model choice is the proportion of zeros and

strutural zeros in the data. Results showed that the percentage of zeros does not only influence

the model performance for the model itself, but also influences the comparison of the Poisson

and Negative Binomial formulations.

When the true model is ZIP, as the proportion of structural zeros increases, model fit of the

true model improves. As the proportion of structural zeros increases, the misspecified models

HP, HNB and ZINB all improve in terms of model fit. However, the difference between the AIC

statistics of the true ZIP model and the best alternative model, ZINB, increase. In other words,

when the true model is ZIP, the advantage of using the true model rather than the next best

alternative, ZINB, becomes more significant. Similarly, when the true model is HP, as the

65

proportion of zeros increases, the true model fit improves. The difference between the AIC

statistics of the true model and the next best alternative (HNB) increases, indicating that the

benefit of using the true model is more significant.

When the true model is ZINB, regardless of the dispersion rate, the model fit of the true

model improves as the proportion of structural zeros increases. In this case, neither the ZIP nor

HP model have good model fit , indicating the model fit is worse when employing the

misspecified Poisson formulation, however, the difference of AIC statistics between the ZIP and

ZINB models decreases as the zero proportion increases. What is more, regarding the relative

bias of prediction, the error rate of the misspecified model ZIP decreases with the increase in

proportion of zeros. When the zero proprtion gets large, the bias of the ZIP is even smaller than

the ZINB model.

When the true model is the HNB, the model fit of the true model improves as the proportion

of zeros increases regardless of the dispersion rate. Again, in this case, both the ZIP and HP

model have poor model fit in terms of AIC statistics. While, regarding the relative bias of

prediction, the misspecified HP model has the least bias when the proportion of zeros is

relatively small and the misspecified ZIP model has the least bias when the proportion of zeros

increases.

The propotion of zeros impact model fit, as does the dispersion rate. When the true model is

ZINB, regardless of the proportion of zeros, the ZINB model has the best model fit when the

dispersion rate equals to 0.5. As the dispersion rate increases, the model fit decreases. In terms of

model comparison, the next best alternative (HNB) also has the best model fit when the

dispersion rate is 0.5. As the dispersion rate increases, the difference between the AIC statistics

for the ZINB and HNB models increases slightly, indicating that the advantage of using the true

66

model gets more significant. Again, across all the four levels of dispersion rates tested, the AIC

statistics of both the HP and ZIP models are quite large, which indicate a poor fit in this case.

Finally, we also find that if the true distribution is either a Zero-inflated model or Hurdle

model, neither the Poisson regression model nor the NB regression model have good model fit,

which indicates that the simple one stage model cannot handle the data well when there exist

excessive zeros, and when it is assumed that the zero observations are generated differently.

Zero-inflated models versus Hurdle models

Prior research that compared Zero-inflated models with Hurdle models has been mostly

based on the assumption of how the zero observations were generated and very little research

compared model performance, especially using a simulation method. This study is the first to

examine both the ZIP and HP models with their Negative Binomial formulations under

simulation by controlling the proportion of zeros and dispersion rate.

First focusing the true model itself, the ZIP model has better model fit than HP, and the

ZINB model has a better model fit than the HNB when applying the true model to the data set

that generated from its real population. In terms of the model fit, across different proportions of

zero and dispersion rates, it indicates that the ZIP model has a larger relative bias than the HP

model when applying the true model to the data generated from its real population; similarly, the

ZINB has a larger bias than HNB.

Next, when considering the possible impact of utilizing a misspecified model, we find when

the true model is the ZIP, the misspecified model HP has close performance and relative bias

compared to the true model especially when the proportion of structural zeros is large.

Conversely, when the true model is the HP, the misspecified model ZIP has a smaller bias than

the true model, and similar model fit performance.This is maily because the HP model has a very

large bias when the proportion of zeros is very large.

67

When the true model is the ZINB, the misspecified HNB has close model fit compared to the

true model, yet the relative bias of the HNB model is very large. What is more, the relative bias

of the misspecified HNB model increases as the proportion and the over-dispersion rates

increase. Specifically, when the zero proportion increases to more than 50%, the relative bias of

mean of the HNB model is more than 100% regardless of the over-dispersion rate. When the true

model is the HNB, the misspecified model ZINB has close performance in terms of model fit,

and the relative bias of the misspecified ZINB model in better than the true HNB model.

Thus, based on this study, the ZIP model appears to be the better performing model in terms

of the model fit. Similarly, the ZINB model appears to be the better performing model in terms

of the model fit, and also has a better prediction capability than the HNB. If the researcher has no

prior assumption of the zero-generating processes, the Zero-inflated model is preferred over the

Hurdle models when choosing between these two competing models. Specifically, attention

should be given to the case when the proportion of zeros is more than 50%. With many zero

observations in the dataset, the Hurdle models tend to have a large relative bias.

Capability of Predicting Zero Observations

In this analysis, we have paid special attention to the models’ capabilities of predicting zeros

and structural zeros (for Zero-inflated models). When the true model is the ZIP, we find that the

HP, HNB, ZIP and ZINB predict zeros very accurately. The Hurdle models always capture the

zero prediction accurately because of the model design. However, in terms of the prediction of

structural zeros, when comparing the ZIP with the ZINB models, both models predict structural

zeros accurately when the zero proportion is smaller than 0.5. When the zero proportion

increases to larger than 50%, the ZINB model likely underestimates the proportion of structural

zeros.

68

When the true model is ZINB, again, the Hurdle models accurately capture the proportion of

zeros. However, in terms of the prediction of structural zeros, the ZIP model significantly

overestimates the number of structural zeros especially when the over-dispersion rate is small.

As for the ZINB model, it has good prediction of the structural zeros when the proportion of

structural zeros is smaller than 50%. When the proportion of structural zeros is larger than 50%,

the ZINB model likely underestimates the number of structural zeros.

When the true model is HP, no structural zeros exist and we only compare the capability of

predicting zeros. In this case, we find that both the ZIP and ZINB models overestimate zero

observations. Similarly, when the true model is HNB, both the ZIP and ZINB overestimate zeros,

however, when the over-dispersion rate gets as high as 4, the issue of over-dispersion for Zero-

inflated models decreases.

Thus, by comparing the four models’ capability of capturing the zero observations, we find

that Hurdle models always predict the number of zeros accurately. However, Hurdle models

cannot differentiate between different types of zeros. Zero-inflated models allow two different

types of zero, yet they may overestimate structural zeros when the proportion of zeros is

relatively small. What is more, when the true model is a Hurdle model, but Zero-inflated models

are used to model the data, zeros are likely to be overestimated. As a result, we strongly suggest

researchers first decide what assumptions to make about the process of generating zeros (whether

or not to expect two types of zeros). If in the research, the focus is mainly on the prediction of

the proportion of zeros, the Hurdle models will provide accurate estimation, yet, if the focus

includes the possible existence of different types of zeros, the Zero-inflated models are preferred.

Future Work and Limitations

As mentioned in the previous sections, sample size is an important factor influencing

simulation experiments, thus affecting the final results. In this experiment, the sample size is set

69

at 250. In the future, it would be interesting to explore the possible influence of sample size on

the model performance by repeating the simulations with different sample sizes. This can be

particularly important given survey research sometimes yields smaller sample sizes.

In this analysis, the number of independent variables was set to be two, with one as binary

and another one as continuous following a normal distribution. However, these two covariates

are set to be independent in this analysis, and no correlation was considered, which might not

always be true with real data. Allowing the two covariates to be dependent, and exploring how

will the different levels of covariate correlation influences the model performance is another area

of interest.

Additionally, in this analysis, only the model fit and their capabilities of predicting zero are

recorded and analyzed. Parameter recovery and confidence intervals were not analyzed. Given

researchers are often interested in the parameters and significance, understanding how the model

will capture the true parameters in the data analysis is another area for future study.

70

Table 2-1. Convergence rate, true model is ZIP

10% 30% 50% 70% 90%

Convergency

rate 100% 99.9% 99.6% 95.4% 86.1%

simulation size=1000

Table 2-2. Mean Loglikelihood, true model is ZIP

Preferred model under each case is indicated with bold.

Table 2-3. Mean AIC, true model is ZIP

P

(structural

zero)

Poisson NB HP ZIP HNB ZINB

10% 3541 1374 1040 1025 1042 1027

30% 7906 3161 939 921 941 923

50% 9589 4148 762 749 764 751

70% 9636 5128 530 522 533 528

90% 4975 3148 226 223 234 232

Average 7129.4 3391.8 699.4 688 702.8 692.2


P

(structrual

zero)


10% -1767.7 -683.1 -514.0 -506.6 -513.9 -506.5

30% -3950.2 -1576.5 -463.4 -454.5 -463.4 -454.4

50% -4791.4 -2069.8 -374.9 -368.4 -374.8 -368.7

70% -4814.8 -2559.9 -258.8 -254.9 -259.5 -256.8

90% -2484.5 -1569.7 -106.9 -105.4 -110.1 -108.9

Average -3561.7 -1691.8 -343.6 -337.9 -344.3 -339.1

71

Table 2-4. Relative Bias for E(Y|X), true model is ZIP

P

(structrual zero) Poisson NB ZIP ZINB HP HNB

10% 0.978 0.978 -0.013 -0.014 -0.125 -0.126

30% 0.976 0.974 -0.071 -0.071 -0.386 -0.386

50% 0.973 0.967 -0.170 -0.169 -0.719 -4.1E+63

70% 0.974 0.963 -0.490 -0.485 -1.543 -1E+114

90% 1.252 1.263 -1.409 -1.015 -6.122 -1.3E+147

Average 1.031 1.029 -0.431 -0.351 -1.779 -2.6E+146


Table 2-5. Observed and predicted zero observations, true model is ZIP

P

(structual

zero)

Obs Poisson NB HP ZIP ZIP(structural

zero) HNB ZINB

ZINB(structural

zero)

0.1 83 68 76 83 84 27 83 84 27

0.3 121 74 99 121 121 75 121 121 75

0.5 158 85 121 158 158 125 158 158 125

0.7 195 94 129 195 195 175 195 195 171

0.9 231 122 152 231 231 224 231 231 197

Table 2-6. Convergence rate, true model is HP

10% 30% 50% 70% 90% Convergency

rate 90.2% 84.2% 75.0% 71.1% 75.4%


72

Table 2-7. Mean Log-likelihood, true model is HP

P

(zero) Poisson NB HP ZIP HNB ZINB

10% -1901.1 -756.0 -515.1 -616.5 -515.0 -616.0

30% -4043.4 -874.5 -490.2 -550.5 -490.2 -550.1

50% -5074.1 -1034.5 -414.0 -448.4 -414.0 -448.2

70% -4681.3 -1396.8 -294.0 -310.0 -294.0 -310.1

90% -2293.0 -960.2 -123.6 -127.1 -135.0 -128.2

Average -3598.6 -1004.4 -367.4 -410.5 -369.6 -410.5


Table 2-8. Mean AIC, true model is HP

P

(structrual

zero)


10% 3808 1520 1042 1245 1044 1246

30% 8093 1757 993 1113 994 1114

50% 10154 2077 840 909 842 911

70% 9369 2802 600 632 602 634

90% 4592 1928 259 266 284 271

Average 7203.2 2016.8 746.8 833.0 753.2 835.2


Table 2-9. Relative Bias for E(Y|X), true model is HP

P

(zero) Poisson NB ZIP ZINB HP HNB

10% 0.978 0.973 0.198 0.207 0.070 0.070

30% 0.975 0.967 0.334 0.337 0.065 0.064

50% 0.969 0.961 0.407 0.409 -0.050 -8.8E+14

70% 0.961 0.952 0.422 0.424 -0.258 -0.259

90% 1.167 1.150 0.179 0.221 -4.546 -2E+99

average 1.010 1.001 0.308 0.320 -0.944 -1.76E+1


73

Table 2-10. Observed and predicted zero observations, true model is HP

P

(zero) Obs Poisson NB HP ZIP HNB ZINB

10% 25 64 56 25 74 25 74

30% 75 69 82 75 105 75 105

50% 124 71 117 124 141 124 141

70% 175 78 145 175 181 175 181

90% 226 116 178 226 226 226 226

Table 2-11. Convergence rate, true model is ZINB

10% 30% 50% 70% 90%

Dispersion=0.5 96.7% 94.7% 92.2% 88.3% 82.6%

Dispersion=1 98.5% 97.9% 93.7% 88.1% 82.9%

Dispersion=2 99.8% 99.4% 97.0% 92.7% 85.6%

Dispersion=4 100% 99.6% 98.6% 92.5% 86.1%


Table 2-12. Mean Log-likelihood, true model is ZINB (disperison=0.5)

P

(structrual

zero)


10% -9282.9 -1974.7 -7436.0 -7436.9 -616.5 -614.5

30% -9240.1 -2836.1 -5364.9 -5364.2 -521.2 -518.2

50% -8331.4 -3408.3 -3548.4 -3547.3 -406.1 -403.0

70% -6483.6 -3157.3 -1861.8 -1861.5 -266.5 -264.5

90% -2559.1 -52550.0 -327.6 -327.7 -105.4 -102.8

Average -7179.4 -12785.3 -3707.7 -3707.5 -383.2 -380.6


74

Table 2-13. Mean AIC, true model is ZINB (disperison=0.5)

P

(structrual

zero)


10% 18571.9 3957.4 14883.9 14885.8 1247.1 1243.1

30% 18486.2 5680.1 10741.9 10740.5 1056.5 1050.3

50% 16668.8 6824.5 7108.9 7106.5 826.2 820.1

70% 12973.2 6322.6 3735.7 3734.9 547.0 543.1

90% 5124.2 6110.1 667.2 667.5 224.7 219.7

Average 14364.8 5578.9 7427.5 7427.0 780.3 775.3


Table 2-14. Mean loglikelihood, true model is ZINB (dispersion=1)

P

(structrual

zero)


10% -6442.8 -1259.7 -4960.2 -4960.1 -647.1 -644.0

30% -7136.6 -2414.8 -3610.4 -3607.6 -555.4 -550.6

50% -6859.7 -3148.4 -2328.2 -2325.3 -434.1 -430.2

70% -5719.4 -2931.7 -1273.3 -1271.4 -290.8 -288.5

90% -2457.0 -1407.9 -266.2 -265.7 -115.5 -113.7

Average -5723.1 -2232.5 -2487.6 -2486.0 -408.6 -405.4


Table 2-15. Mean AIC, true model is ZINB (disperison=1)

P

(structrual

zero)


10% 12891.6 2527.5 9932.4 9932.1 1308.2 1301.9

30% 14279.3 4837.7 7232.9 7227.2 1124.9 1115.2

50% 13725.4 6304.8 4668.4 4662.6 882.1 874.4

70% 11444.8 5871.6 2558.5 2554.9 595.6 590.9

90% 4920.0 2823.9 544.4 543.4 245.1 241.4

Average 11452.3 4473.1 4987.3 4984.1 831.2 824.7


75

Table 2-16. Mean loglikelihood, true model is ZINB (dispersion=2)

P

(structrual

zero)


10% -4271.3 -920.7 -2938.5 -2936.1 -641.6 -636.9

30% -5484.7 -1877.1 -2134.8 -2129.9 -556.2 -549.7

50% -6095.8 -2726.5 -1480.1 -1476.1 -439.6 -434.6

70% -5316.4 -2760.0 -818.6 -816.0 -295.8 -292.6

90% -2459.7 -1472.2 -203.6 -202.6 -117.8 -116.2

Average -4725.5 -1951.3 -1515.1 -1512.1 -410.2 -406.0



P

(structrual

zero)


10% 8548.5 1849.4 5889.0 5884.2 1297.3 1288.0

30% 10975.6 3762.3 4281.6 4271.8 1126.4 1113.3

50% 12197.7 5460.9 2972.4 2964.2 893.4 883.3

70% 10638.7 5528.0 1649.3 1643.9 605.5 599.3

90% 4925.5 2952.4 419.2 417.3 249.5 246.4

Average 9457.2 3910.6 3042.3 3036.3 834.4 826.1


76

Table 2-18. Mean Loglikelihood, true model is ZINB (disperison=4)

P

(structrual

zero)


10% -3072.9 -776.7 -1779.1 -1774.7 -623.5 -617.5

30% -4705.2 -1644.5 -1392.9 -1386.2 -545.7 -538.1

50% -5403.6 -2495.2 -923.5 -917.9 -431.9 -425.9

70% -4997.9 -2810.9 -547.4 -544.1 -293.1 -289.9

90% -2490.1 -1620.3 -159.2 -157.9 -117.6 -116.1

Average -4133.9 -1869.5 -960.4 -956.2 -402.4 -397.5



P

(structrual

zero)


10% 6151.9 1561.5 3570.2 3561.4 1260.9 1249.1

30% 9416.4 3297.0 2797.8 2784.4 1105.5 1090.4

50% 10813.2 4998.3 1859.0 1847.9 877.8 865.9

70% 10001.7 5629.9 1106.9 1100.1 600.1 593.9

90% 4986.1 3248.6 330.5 327.9 249.3 246.3

Average 8273.9 3747.1 1932.9 1924.3 818.7 809.1


Table 2-20. Mean AIC of ZINB model, true model is ZINB

P

(strutural zero) D=0.5 D=1 D=2 D=4

Average

10% 1243.1 1301.9 1288.0 1249.1 1270.5

30% 1050.4 1115.2 1113.3 1090.4 1092.3

50% 820.0 874.4 883.3 865.9 860.9

70% 543.1 590.9 599.3 593.9 581.8

90% 219.7 241.4 246.4 246.3 238.4

Average 775.3 824.8 826.1 809.1 808.8

77

Table 2-21. Relative Bias for E(Y|X), true model is ZINB (dispersion=0.5)

P

(strutural

zero)

Poisson NB ZIP ZINB HP HNB

10% 0.971 0.973 -0.118 -0.099 -0.253 -0.314

30% 0.966 0.968 -0.209 -0.258 -0.514 -0.610

50% 0.960 0.961 -0.213 -0.334 -0.870 -9.8E+42

70% 0.958 0.964 -0.803 -0.590 -14.90 -1.4E+85

90% 1.261 1.267 -0.024 0.2504 -1680.57 -2E+106

Average 1.023 1.0273 -0.274 -0.206 -339.422 -4E+102


Table 2-22. Relative Bias for E(Y|X), true model is ZINB (dispersion=1)

P

(strutural

zero)


10% 0.973 0.975 -0.121 -0.081 -0.234 -0.256

30% 0.970 0.970 -0.190 -0.208 -0.457 -0.574

50% 0.965 0.962 -0.214 -0.295 -0.756 -1.7E+65

70% 0.947 0.942 0.353 0.413 -0.494 -1.3E+40

90% 1.248 1.243 0.087 0.290 -1305.27 -2E+113

Average 1.021 1.0188 -0.017 0.023 -261.443 -4E+112



P

(strutural zero) Poisson NB ZIP ZINB HP HNB

10% 0.976 0.977 -0.07 -0.054 -0.168 -0.202

30% 0.973 0.971 -0.143 -0.142 -0.425 -0.473

50% 0.970 0.966 -0.240 -0.228 -0.817 -2E+65

70% 0.967 0.961 -0.446 -0.509 -1.485 -4.8E+96

90% 1.318 1.341 -1.384 -2.092 -56228.9 -2E+14

Average 1.041 1.0436 -0.457 -0.605 -11246.3 -4E+14

78


P

(strutural

zero)


10% 0.977 0.977 -0.063 -0.047 -0.168 -0.175

30% 0.973 0.971 -0.125 -0.121 -0.431 -0.437

50% 0.971 0.966 -0.267 -0.271 -0.978 -0.928

70% 0.971 0.962 -0.463 -0.463 -1.561 -4.1E+55

90% 1.279 1.294 -1.500 -1.012 -9.5E+1 -2E+132

Average 1.034 1.034 -0.484 -0.384 -1.9E+1 -4E+131


Table 2-25. Relative Bias for E(Y|X) of ZINB model, true model is ZINB

P


Average

10% -0.099 -0.081 -0.054 -0.047 -0.070

30% -0.258 -0.208 -0.142 -0.121 -0.182

50% -0.334 -0.295 -0.228 -0.271 -0.282

70% -0.590 0.413 -0.509 -0.463 -0.287

90% 0.250 0.290 -2.092 -1.018 -0.642

Average -0.206 0.023 -0.605 -0.384 -0.293

Table 2-26. Observed and predicted zero observations, true model is ZINB (disipersion=0.5)

P

(structual

zero)


zero) HNB ZINB

ZINB(structural

zero)

10% 120 63 107 120 122 95 120 120 33

30% 148 66 118 148 149 128 148 148 76

50% 177 73 127 177 178 161 177 177 117

70% 207 85 141 207 207 195 207 206 132

90% 235 128 169 235 235 227 235 235 129

79

Table 2-27. Observed and predicted zero observations, true model is ZINB (disipersion=1)

P

(structual

zero)


zero) HNB ZINB

ZINB(structural

zero)

10% 102 64 94 102 104 68 102 102 29

30% 134 69 107 134 135 105 134 134 75

50% 168 76 117 168 168 145 168 167 123

70% 200 87 128 200 201 186 200 200 158

90% 233 126 164 233 233 225 233 233 158


P

(structual

zero)


zero) HNB ZINB

ZINB(structural

zero)

10% 93 66 86 93 95 50 93 93 28

30% 128 72 104 128 129 91 128 128 76

50% 162 80 117 162 163 135 162 162 125

70% 198 88 127 198 198 181 198 197 166

90% 232 125 159 232 232 225 232 232 174


P

(structual

zero)


zero) HNB ZINB

ZINB(structural

zero)

10% 88 68 81 88 89 39 88 88 27

30% 124 73 102 124 125 85 124 124 76

50% 160 80 116 160 160 131 160 160 125

70% 196 91 126 196 196 177 196 196 167

90% 232 122 151 232 232 224 232 232 188

80

Table 2-30. Convergence rate, true model is HNB

10% 30% 50% 70% 90%

Dispersion=0.5 83.3% 81.9% 80.4% 80.2% 82.0%

Dispersion=1 84.2% 80.1% 78.8% 75.1% 77.0%

Dispersion=2 83.6% 79.3% 76.0% 73.5% 78.6%

Dispersion=4 87.3% 82.3% 76.5% 72.4% 77.4%


Table 2-31. Mean Loglikelihood, true model is HNB (disperison=0.5)

P


10% -9174.8 -891.6 -7674.1 -7736.6 -714.3 -785.4

30% -8929.8 -794.2 -5500.9 -5537.6 -642.3 -680.6

50% -8227.5 -936.9 -3601.6 -3623.7 -523.9 -544.1

70% -6222.3 -1039.8 -1853.1 -1864.1 -360.3 -369.3

90% -2624.9 -967.8 -411.8 -415.3 -154.7 -150.1


Table 2-32. Mean AIC, true model is HNB (disperison=0.5)

P


10% 18355.7 1791.3 15360.3 15485.2 1442.5 1584.8

30% 17865.6 1596.4 11013.8 11087.18 1298.6 1375.1

50% 16461.1 1881.9 7215.3 7259.3 1061.9 1102.3

70% 12450.6 2087.6 3718.2 3740.2 734.7 752.7

90% 5256.0 1943.7 835.6 842.5 323.5 314.2

Average 14077.8 1860.2 7628.6 7682.8 972.3 1025.8


81

Table 2-33. Mean Loglikelihood, true model is HNB (disperison=1)

P


10% -6158.3 -790.6 -4764.7 -4843.9 -687.2 -760.4

30% -6898.6 -813.7 -3540.4 -3588.4 -623.4 -664.9

50% -7084.0 -1146.0 -2426.6 -2454.1 -508.7 -531.4

70% -5458.7 -1118.2 -1278.8 -1292.1 -351.7 -362.3

90% -2345.9 -826.1 -300.8 -304.8 -150.7 -146.4

Average -5589.1 -938.9 -2462.3 -2496.6 -464.4 -493.1


Table 2-34. Mean AIC, true model is HNB (disperison=1)

P


10% 12322.7 1589.3 9541.4 9699.8 1388.3 1534.8

30% 13803.2 1635.3 7092.9 7188.7 1260.8 1343.8

50% 14174.0 2300.0 4865.3 4920.1 1031.5 1076.8

70% 10923.5 2244.4 2569.6 2596.2 717.5 738.6

90% 4697.9 1660.1 613.7 621.7 315.5 306.8

Average 11184.3 1885.8 4936.6 5005.3 942.7 1000.2



P


10% -4447.8 -884.8 -2899.8 -2988.7 -660.0 -737.3

30% -5523.5 -842.9 -2180.7 -2232.4 -604.1 -648.7

50% -5987.8 -874.9 -1476.7 -1506.7 -492.9 -518.1

70% -5188.3 -1377.1 -854.7 -869.4 -345.2 -357.3

90% -2351.1 -982.1 -227.2 -230.8 -149.1 -144.5

Average -4699.7 -992.4 -1527.8 -1565.6 -450.3 -481.2


82


P


10% 8901.6 1777.6 5811.7 5989.4 1334.1 1488.6

30% 11053.0 1693.9 4373.5 4476.7 1222.3 1311.5

50% 11981.6 1757.9 2965.4 3025.5 999.9 1050.3

70% 10382.6 2762.3 1721.5 1750.8 704.4 728.6

90% 4708.2 1972.2 466.4 473.7 312.3 303.0

Average 9405.4 1992.8 3067.7 3143.2 914.6 976.4



P


10% -3198.9 -770.4 -1752.3 -1847.4 -631.2 -713.2

30% -4810.3 -801.4 -1375.4 -1430.6 -579.9 -628.5

50% -5774.9 -1010.5 -1002.8 -1034.0 -479.2 -506.8

70% -4991.6 -1250.9 -564.1 -579.6 -333.1 -346.5

90% -2397.6 -940.3 -181.6 -185.4 -148.4 -143.0

Average -4234.7 -954.7 -975.2 -1015.4 -434.3 -467.6



P


10% 6403.9 1548.9 3516.5 3706.8 1276.5 1440.4

30% 9626.6 1610.7 2762.8 2873.2 1173.7 1271.0

50% 11555.8 2028.9 2017.6 2080.1 972.5 1027.7

70% 9989.7 2509.9 1140.2 1171.2 680.3 707.1

90% 4801.3 1888.7 375.3 382.8 310.9 300.1

Average 8475.5 1917.5 1962.5 2042.8 882.8 949.3


83

Table 2-39. Mean AIC of HNB, true model is HNB

P


Average

10% 1442.5 1388.3 1334.1 1276.5 1360.3

30% 1298.6 1260.8 1222.3 1173.7 1238.8

50% 1061.9 1031.5 999.9 972.4 1016.4

70% 734.7 717.5 704.4 680.3 709.2

90% 323.5 315.5 312.3 310.9 315.6

Average 972.3 942.7 914.6 882.8 928.1

Table 2-40. Relative Bias for E(Y|X), true model is HNB (dispersion=0.5)

P


10% 0.967 0.963 0.196 0.478 0.031 -0.100

30% 0.960 0.956 0.270 0.493 0.007 -0.173

50% 0.951 0.947 0.321 0.506 -0.094 -0.316

70% 0.934 0.933 0.295 0.458 -0.493 -3.E+53

90% 1.261 1.267 -0.024 0.250 -168.57 -2E+106

Average 1.014 1.014 0.212 0.437 -33.82 -4E+105


Table 2-41. Relative Bias for E(Y|X), true model is HNB (dispersion=1)

P


10% 0.972 0.967 0.231 0.504 0.090 0.013

30% 0.966 0.961 0.310 0.519 0.057 -0.053

50% 0.958 0.953 0.365 0.532 -0.076 -0.183

70% 0.947 0.942 0.353 0.493 -0.414 -1.2E+40

90% 1.248 1.243 0.087 0.290 -1305.27 -1.5E+113

Average 1.018 1.013 0.269 0.468 -261.12 -3.1E+112


84


P


10% 0.975 0.970 0.230 0.484 0.097 0.065

30% 0.971 0.965 0.331 0.510 0.078 0.019

50% 0.963 0.957 0.390 0.526 -0.049 -0.095

70% 0.954 0.948 0.384 0.494 -0.359 -9.5E+49

90% 1.219 1.207 0.125 0.258 -8.849 -6.6E+120

Average 1.016 1.009 0.296 0.454 -1.816 -1.3E+120



P


10% 0.977 0.971 0.217 0.435 0.086 0.075

30% 0.973 0.966 0.334 0.478 0.076 0.045

50% 0.966 0.959 0.404 0.507 -0.032 -0.053

70% 0.957 0.950 0.404 0.482 -0.313 -7.3E+51

90% 1.192 1.177 0.154 0.257 -5.982 -5E+124

Average 1.013 1.004 0.302 0.432 -1.233 -1E+124


Table 2-44. Relative Bias for E(Y|X) of HNB model, true model is HNB

P

(zero) D=0.5 D=1 D=2 D=4

Average

10% -0.100 0.0138 0.065 0.075 0.013

30% -0.173 -0.053 0.019 0.045 -0.040

50% -0.316 -0.183 -0.095 -0.05 -0.161

70% -3E+53 -1E+40 -9.5E+49 -7.3E+51 -8.9E+52

90% -2E+106 -1E+113 -6E+120 -5E+124 -1E+124

Average -4E+105 -3E+112 -1E+120 -1E+124 -2E+123

85

Table 2-45. Observed and predicted zero observations, true model is HNB(dispersion=0.5)

Table 2-46. Observed and predicted zero observations, true model is HNB(dispersion=1)

P


10% 25 53 59 25 65 25 62

30% 75 58 89 75 99 75 96

50% 125 60 125 125 138 125 136

70% 175 71 158 175 180 175 179

90% 225 116 192 225 226 225 226


P


10% 25 44 61 25 58 25 63

30% 75 47 92 75 94 75 96

50% 125 54 128 125 135 125 135

70% 175 63 165 175 179 175 179

90% 225 110 193 225 226 225 225

P


10% 25 58 58 25 70 25 63

30% 75 61 86 75 102 75 97

50% 125 66 123 125 139 125 136

70% 175 73 148 174 180 174 179

90% 225 114 180 225 226 225 226

86


P


10% 25 13 22 25 24 25 24

30% 75 41 51 75 74 75 74

50% 125 73 81 125 125 125 124

70% 175 108 115 175 174 175 174

90% 225 174 186 225 225 225 225

Table 2-49. Average AIC statistics across all the models

ZIP HP ZINB HNB

ZIP 688.1 833.0 4342.9 4468.6

HP 699.4 746.8 4347.5 4398.9

ZINB 692.2 835.2 808.8 987.9

HNB 702.8 753.2 816.2 928.1

Table 2-50. Average Relative Bias across all the models

ZIP HP ZINB HNB

ZIP -0.431 0.308 -1.232 0.269

HP -1.779 -0.944 -1.172 -74.490

ZINB -0.351 0.320 -1.90E+11 0.448

HNB -2.59E+146 -1.76E+14 -4.00E+141 -2.50E+123

87

CHAPTER 3

A TRIPLE HURDLE COUNT DATA MODEL OF MARKET PARTICIPATION AND

CONSUMPTION

Background

In empirical economics, there has long been interest in modeling consumers’ behaviors,

in particular consumers’ preferences and purchases, and analyzing and predicting market

structure. When analyzing consumption behavior on the individual level, researchers frequently

find themselves working with count data, especially when collecting primary data using survey

instruments. Count data can be found when measuring consumption frequency or intensity

during a certain time period. Data on the frequency of purchase of food products in a given

period may provide unique challenges, as one might find many observations recorded as zero-

consumption. For example, if consumers are asked “How often did you consume blueberries last

month?”, there may be many respondents who answer that they did not consume in the past

month (hence a zero observation). Data with excess zero observations is referred to as having the

issue of zero-inflation.

Though there are models to deal with zero-inflation, for consumption data, there is an

interesting issue, as the zero observations may be a result of different reasons. One reason for a

zero observation is that some individuals have a non-positive desire for the product. For some

permanent reason, these individuals will not be consumers of the product (i.e. they are allergic).

However, a different reason for a zero observation is that some individuals have a positive desire

for the product, but they do not consume for some temporary reason (i.e. they may not be able to

afford the good at the given price). In this case, zero consumption is the corner solution for the

individual’s utility-maximizing decision. Similarly, some individuals might have a positive

desire to consume the product, but not during the recorded period (i.e. past month) due to

infrequent or seasonal consumption, a common issue with fresh fruits and vegetables. Individuals

88

who show zero consumption in the first case are non-consumers, and these zero observations are

structural zeros. The individuals who have positive desire to consume, but were observed

consuming zero units because of the second and third reasons are potential consumers, and the

corresponding zero observations are sampling zeros. The particular interpretations given to these

zero consumption observations can have a crucial bearing on the estimation techniques, and the

interpretation of market segmentation.

In survey research, both non-consumers (those who do not have positive participation

desire) and potential consumers (those who have positive market participation but choose to

consume zero units) are observed to have zero-consumption and are often treated as one in

modeling. However, the decision of market participation would be driven by a structurally

different process than the subsequent consumption decision and consumption intensity decision

for potential consumers compared to non-consumers. Thus, analyzing the different factors

influencing consumers’ participation and consumption decisions will provide researchers,

retailers, and producers with a better understanding of consumer behaviors. This can only be

achieved by modeling these decisions separately.

Existing analyses of market participation and consumption are mostly based on a

“double-hurdle” modeling approach. The double-hurdle approach assumes that the process of

generating zero-consumption is handled separately from the process of generating positive

consumption. However, it fails to distinguish between potential consumers and non-consumers,

which are all observed at zero-consumption. It is possible that marketing strategies developed on

these models to target non-consumers might exert different influences on non-consumers and

potential consumers. To address these limitations, this paper presents a “triple-hurdle” count data

model which allows us to observe participation intention in the first hurdle, and conditional on

89

the participation decision, consumers would further make the subsequent consumption and

consumption intensity decisions (allowing for positive participation but zero consumption

decisions). This model is used to classify three types of consumers in the market: non-

consumers, potential consumers, and consumers, and explores the appropriate structurally

different reasons explaining the three groups market participation, consumption, and

consumption intensity in sequence.

The Econometric Modeling of Count Data for Consumption Behavior

When dealing with the problem of “excess zeros”, a variety of statistical techniques have

been proposed and applied in economic literature. One of the most widely used is the Tobit

model (Tobin, 1958). It was developed to account for the limited capacity of simple linear

regression in the presence of a preponderance of zero observations. However, the Tobit model

assumes zeros represent censored values of an underlying normally distributed latent variable

that theoretically includes negative values. This results in a restrictive model that assumes all

zero observations are structural zeros resulting from the same generating process (there is no

allowance for the possibility of sampling zeros). The model is also restrictive by assuming that it

is the same set of factors that influence both consumers’ desire and acquisition. To solve the

shortcoming of the Tobit model, a number of generalizations to the Tobit model have been

developed.

The most popular generalizations of the Tobit model are Heckman’s sample selection

model and the double-hurdle model. When modeling consumption behavior, given the different

reasons causing zero-consumption, these models assume that individuals must pass two stages

before being observed with a positive level of consumption: a participation decision and a

consumption decision. The difference between the double-hurdle and sample selection models is

90

in the assumption of dominance: whether the participation decision dominates the consumption

decision.

In the double-hurdle model, there is an assumption that positive consumption is observed

only when consumers have overcome both stages. The observed consumption variable is given

by y=dy* where d is the indicator for a consumers’ desire to participate, and y* is the indicator

for the consumers’ determination on the consumption level. In order to observe a positive

consumption level, both d and y* must be positive. In Heckman’s sample selection model, there

is an assumption that the participation decision dominates the consumption decision, which

implies that all zero observations are structural zeros, and zero consumption does not arise from

a standard corner solution. To express the dominance using the equation, it implies that

p(y*>0|d=1)=1.

One significant problem with the Tobit model and its generalizations are that they assume

the latent variable is normally distributed, and they are very sensitive to violations of the

assumption of normality (Arabmazer and Schmidt, 1982), thus the Tobit model and its

generalized models have significant restrictions when applied to the analysis of consumption

behavior.

When the dependent variable is in the format of count data, one of the most popular

regression techniques is the Poisson regression. Poisson regression is commonly used in

economics to model the number of events, for example, the frequency of consumption. However,

the Poisson model fails to provide an adequate fit when there exists the problem of “excessive

zeros”. The Poisson model has a basic assumption of mean-variance equality which is violated

when “excessive zeros” pull the mean towards zero. A number of modified Poisson regression

91

models have been developed to account for excess zeros, the most popular of which are zero-

inflated/modified Poisson models and Hurdle count models.

The zero-inflated Poisson (ZIP) model was proposed by Lambert in 1995. This zero-

inflated count data model assumes that zero observation come from two distinct sources:

“sampling zeros” and “structured zeros.” When applied to the analysis of consumption, there is

an assumption that zero-consumption can either be recorded when the consumer is genuine non-

participant (structural zero), or when the zero consumption is the corner solution of a standard

consumer demand problem (sampling zero).

Different from the zero-inflated count data model, the hurdle count model proposed by

Mullahy (1986) assumes that all zeros are sampling zeros. When applied to the analysis of

consumption, there is an assumption that individuals need to pass two stages before being

observed with a positive level of consumption: a participation decision and a consumption

decision. Furthermore, the Hurdle models assume participation dominant. Thus, all the zero

observations are assumed generated in the first stage (decision on whether to consume), and in

the second stage, consumption behavior is truncated at zero.

Shonkwiler and Shaw (1996) extended the Hurdle count data model by allowing zero

observations in both the first and second stages. Thus, in Shonkwiler and Shaw’s Double Hurdle

count data model, there are two mechanisms generating zero observations: zero observations can

either happen in the first stage by choosing not to consume, or in the second stage by choosing to

consume at zero frequency. In their research, Shonkwiler and Shaw applied the double hurdle

count data model to analyze recreation demand, and they classified people into three categories:

“user”, “potential user”, and “non-user”. They define a “user” as a person who is currently

consuming the product, a non-user as a person who has never consumed the product before, and

92

likely will not consume the product in the future, and a “potential-user” as a person who has ever

consumed the product before, but is not consuming the product in the given period. In

Shonkwiler and Shaw’s research, they also made connections between the zero-inflated count

data model to the double-hurdle count data model by laying out the probability mass function for

both models, and they concluded that the zero-inflated count data model and the double-hurdle

count data model are essentially the same. They both allow the zero observations generated from

two separate processes, allowing the zero observations to be either structural or sampling zeros.

However, although these previous studies assumed that the zero observations might be

generated by two distinct sources (non-participants and potential consumers), they fail to

differentiate between the two types of consumer segments. In other words, the previous research

still assumes consumers make two subsequent decisions on consumption intention and

consumption frequency (though consumption frequency can be chosen at zero).

This study contributes to the literature by proposing a triple hurdle count data model,

which allows us to differentiate between three different types of consumers: non-participants,

potential-consumers, and consumers, and explore the appropriate structurally different reasons

explaining consumers’ market participation intention, consumption intention, and consumption

intensity decision. The results of the triple hurdle count data model will provide more detailed

information that can be useful to better classify markets into three segments: non-participants,

potential consumers, and consumers.

Motivation

The consumption of fresh produce is influenced by many different factors, which can be

stable or unstable. For example, consumers might choose not to consume certain fruits or

vegetables because of allergies, taste preferences, or diet constraints. These factors are

considered stable factors, which cause consumers to virtually ignore that type of fresh produce in

93

their decision making. These consumers would be expected to have non-positive market

participation intention for specific item of produce, thus are considered non-consumers.

However, other consumers might be influenced by unstable reasons. One significant

unstable factor for the consumption of fresh produce is seasonality. The consumption of fresh

produce can change significantly in different seasons. This can be a result of decreased supply

and availability leading to changes in prices and/or origin of producers at different times of the

year. Even though some consumers might have positive participation intention for some fruits or

vegetables, they may still choose not to consume during the off-season because of the high price.

This is referred to as the corner solution of a standard consumption problem. These consumers

influenced by seasonality (price) would change their consumption behavior when the

circumstances differ, thus are considered as potential consumers.

Taking the consumption of fresh blueberry as an example, the total observed zero-

consumption per month is significantly higher in winter and lower in summer. However, when

we differentiate the observed zero-consumption into non-consumers (never purchase) and

potential consumers (purchased before but not in the last month), the number of non-consumers

appears to be comparatively stable over the year, while the number of potential consumers

changes significantly over the year (Figure 3-1).

Although both non-consumers and potential consumers report zero-consumption, it

would be inappropriate to treat all zero observations the same for fresh produce consumption.

The decision of market participation appears to be driven by a structurally different process than

the subsequent consumption decision, and consumption intensity decision. Analyzing the

different factors influencing consumers’ participation, consumption, and consumption intensity

decisions can provide researchers, retailers, and producers a better understanding of consumer

94

behaviors, thus help them to develop effective and separate promotion strategies targeting non-

participants, potential consumers, and current consumers. It can also be of use to policy makers

who might be interested in policies targeting increased consumption of certain healthy foods, like

fruits and vegetables.

Conceptual Framework

To develop the triple hurdle count data model, we first begin by outlining the existing

double-hurdle approach. Previous studies have theorized that the observed zero-consumption

could be driven by two different mechanisms: non-consumers (who have non-positive market

participation desire) and potential consumers (who have non-positive consumption intention

given positive participation desire). Although these studies allow for the idea that factors

influencing market participation could be different from the factors influencing consumption

decisions, they fail to observe the consumers’ actual participation desire. They are also restrictive

by assuming that it is the same mechanism that determines consumption intention and

consumption frequency decisions, which might not always be true.

In the triple hurdle count data model, we relax the restrictions of the double-hurdle

approach, and extend the framework to differentiate three types of consumers and allow three

different mechanisms to generate the consumers’ decisions on market participation, consumption

intention and consumption intensity.

The full triple hurdle data model specification can be represented as:

R = R (consumers’ characteristics, products’ characteristics, seasonal effect)

D = D (consumers’ characteristics, products’ characteristics, seasonal effect)

Y = Y (consumers’ characteristics, products’ characteristics, seasonal effect)

Where R is a binary indicator of whether the consumer has a positive desire to participate

in the market, D is a binary indicator of whether the consumer would have a positive

95

consumption intention in the given time period given positive desire to participate, and Y is

positive integers indicating consumption frequency/intensity.

Econometric Framework

In this section, we start by proposing the triple hurdle count data model with the three

stages independent of each other, then we further allow the three stages to be correlated. Next,

we outline the estimation strategy and discuss inference and interpretation of the results. We also

discuss the double hurdle approach for the purpose of comparison.

Triple Hurdle Count Data Model with Independent Stages

The triple hurdle count data model, a mixture of Poisson regression models, is an

extension of the hurdle count data model proposed by Mullahy (1984). Mullahy’s model (hurdle

count data model) included a market participation stage before the consumption stage. In the

triple hurdle model, instead, there are three stages identified: market participation, consumption

intention, and consumption intensity. Thus, the triple hurdle count data model involves three

latent equations to indicate the three stages in succession, with the first two equations having

binary outcomes indicating participation and consumption intention, and the third equation

having positive count outcome indicating consumption intensity. This splits the observations into

three regimes (non-participants, potential consumers, and consumers) that relate to potentially

three different sets of explanatory variables. Figure 3-2 is a diagram of the data generating

process involved in the triple hurdle count data model.

The model specification for the triple hurdle count data model is as follows Equation 3-1

to Equation 3-3:

Market participation stage

Pr (𝑅∗=r) = exp(−𝜃1)∗𝜃1

𝑟

𝑟! r=0,1,2,3….

96

R = {0 𝑖𝑓 𝑅∗ ≤ 0 1 𝑖𝑓 𝑅∗ > 0

(3-1)

Where R denotes the binary indicator of whether to participate or not (with R=0 for non-

participants, and R=1 for participants). R is related to a latent variable R* via the mapping: R=1

for R*>0 and R=0 for R*≤0. The latent variable R* represents the propensity for market

participation, specifically, we adopt the Poisson distribution2 for R*. 𝜃1 is the parameter for the

Poisson distribution, which can be parameterized as 𝜃1 = exp (𝑥′𝛽) ,where x is a vector of

covariates and 𝛽 is a vector of unknown coefficients.

Consumption intention stage

Pr (𝐷∗=d) = exp(−𝜃2)∗𝜃2

𝑑

𝑑! d=0,1,2,3……

D = {0 𝑖𝑓 𝐷∗ ≤ 0 1 𝑖𝑓 𝐷∗ > 0

(3-2)

Conditional on participation (R=1), consumers make the second decision on whether to

consume during a specific time period. Let D denote a second binary indicator of whether to

consume or not in the given period (with D=0 for non-consumption, and D=1 for positive

consumption), where D is also related to a latent variable D* via the mapping: D=1 for D*>0 and

D=0 for D*≤0. We also adopt the Poisson distribution for D*. 𝜃2 is the parameter for the Poisson

distribution of D*, which can be parameterized as 𝜃2 = exp (𝑧′𝛼), where z is a vector of

covariates that determine consumers’ second choice and 𝛼 is the corresponding unknown vector

of parameters. Furthermore, there is no requirement that x=z.

Consumption intensity stage

2 It is possible that R* could be a continuous variable and generated by other approaches, for example, R* could be

possibly distributed with Normal distribution, then Pr (R*≤0) = Φ(-𝑥′𝛽) where Φ is the cumulative distribution

function for the normal distribution. However, in order to derive the sample-selected hurdle count data model with

interdependence, we employ the Poisson regression for the latent variable R*.

97

Pr(Y*=y) = exp (−𝜃3)∗𝜃3

𝑦

1−exp (−𝜃3） y=1,2,3,4….. (3-3)

Conditional on consumption in the given period (D=1 and R=1), positive consumption

frequency is observed, and the consumption intensity is represented by a latent variable Y*

(Y*=1,2,3,…J) which is generated by a Poisson regression truncated at 0. 𝜃3 is the parameter for

the Poisson distribution of Y*. 𝜃3 could be parametrized as 𝜃3 = exp (𝑤′𝛾), where w is a vector

of covariates that determine consumers’ consumption intensity, and 𝛾 is the corresponding

unknown vector of parameters. In this stage, there is no requirement that w=z=x.

Accordingly, in order to observe a non-participant, it is required that R=0; to observe a

potential consumer, it is required jointly that the individual is a participant (R=1) that chooses

not consume in the given period (D=0); and to observed positive consumption, we require jointly

that the individual is a participant (R=1), and that they choose to consume a positive intensity

(D=1, and y*>0).

Under the assumption that the three stages are independent, the probability of an

individual being a non-participant is:

Pr (R=0| x) = Pr(R*≤0|x) = exp(-𝜃1) (3-4)

The probability of an individual being a potential-consumer is:

Pr (D=0|x,z) = Pr(R*>0) * Pr(D*≤0) = (1-exp (-𝜃1))*exp (−𝜃2) (3-5)

And the probability of observing positive consumption intensity, y, is:

Pr(Y=y|x,z,w) = Pr(R*>0)*Pr(D*≤0) *Pr(Y*=y) = (1-exp (-𝜃1))*(1-exp (−𝜃2))

× exp (−𝜃3)∗𝜃3

𝑦

1−exp (−𝜃3） y=1,2,3,…. (3-6)

In this way, given the independence of the three stages, the probability of observing a

non-participant is exp(-𝜃1), the probability of observing a potential-consumer is (1-exp (-

98

𝜃1))*exp (−𝜃2), and the probability of observing a positive consumption intensity is a

combination of the three separate processes. Note that this specification differentiates zero

observations into two different regimes coming from two different generating processes. The

first process selects the individuals who have positive desire and the second process generates

individuals who determines zero-consumption given positive participation desire

Once the full set of probabilities have been specified, for any given observation, i, the

sample-selected hurdle count data model has the following likelihood function in Equation 3-7:

f(R𝑖, D𝑖 , Y𝑖, | 𝜃1,𝜃2,𝜃3) =

[exp(−𝜃1)]1[R𝑖=0] *((1 − exp(−𝜃1)) ∗ (

[exp (−𝜃2)]1[D𝑖=0]

[(1−exp(−𝜃2))∗exp (−𝜃3)∗𝜃3

𝑦

1−exp(−𝜃3) ]

1[D𝑖=1]

))

1[R𝑖=1]

(3-7)

Where 𝜃1 = exp(𝑥′𝛽), and 𝛽 are the parameters on x in the first stage, 𝜃2 = exp (𝑧′𝛼),

and 𝛼 are the parameters on z in the second stage, and 𝜃3 = exp (𝑤′𝛾) with 𝛾 being the set of

parameters on 𝑤 in the third stage.

Triple Hurdle Count Data Model with Interdependence

The assumption that the three stages are not related is restrictive, it is quite plausible that

the three stages are related. To accommodate that we now extend the model to have the three

stages correlated, which requires the latent variables (D*, R*, Y*) follow a trivariate Poisson

distribution. The full observability criteria of observing the three types of consumers are as

follows:

A consumer is a non-participant if R=0, is a potential consumer if (R=1 and D=0) and is a

positive consumer with a positive consumption level y if (R=1, D=1, and y*=y), which translates

into the following expressions for the probabilities (Equation 3-8 to Equation 3-10):

Non-participants: Pr(R=0|x) = Pr(R*=0|x) (3-8)

99

Potential-consumers: Pr(D=0, R=1|x, z) = ∑ 𝑃𝑟(𝑅∗ = 𝑟，𝐷∗ = 0|𝑥, 𝑧)∞𝑟=1 (3-9)

Positive consumption: Pr(Y=y|x,z,w)= ∑ ∑ 𝑃𝑟(∞𝑟=1

∞𝑑=1 𝑅∗ = 𝑟, 𝐷∗ = 𝑑, 𝑌∗ = y|x, z, w)

where r=0,1,2….; d=0,1,2,….; y=1,2,3…. (3-10)

Considering the trivariate Poisson distribution with two-way covariance structure (𝑅∗, 𝐷∗

𝑌∗) ~TP (𝜃1, 𝜃2, 𝜃3, 𝜃12, 𝜃13, 𝜃23 ), which takes the form:

𝑅∗ = 𝑍1+𝑍12+𝑍13 (3-11)

𝐷∗ = 𝑍2+𝑍12+𝑍23 (3-12)

𝑌∗ = 𝑍3+𝑍13+𝑍23 (3-13)

Where 𝑍𝑖 ~Po(𝜃𝑖), i∈{1,2,3}, and 𝑍𝑖𝑗 ~Po(𝜃𝑖𝑗), i,j∈{1,2,3}, i<j. Then 𝑅∗ follows

marginally a Poisson distribution with parameter (𝜃1 + 𝜃12 + 𝜃13), 𝐷∗ follows marginally a

Poisson distribution with parameter (𝜃2 + 𝜃12 + 𝜃23), and

𝑌∗ follows marginally a Poisson distribution with parameter (𝜃3 + 𝜃13 + 𝜃23). (𝑅∗, 𝐷∗), (𝑅∗,

𝑌∗), and (𝐷∗, 𝑌∗) marginally follow the bivariate Poisson distributions as follows:

(𝑅∗, 𝐷∗) ~ BPoisson (𝜃1 + 𝜃13, 𝜃2 + 𝜃23, 𝜃12) with Cov(𝑅∗, 𝐷∗)= 𝜃12

(𝑅∗, 𝑌∗) ~ BPoisson (𝜃1 + 𝜃12, 𝜃3 + 𝜃23, 𝜃13) with Cov(𝑅∗, 𝑌∗) = 𝜃13

(𝐷∗, 𝑌∗) ~ BPoisson (𝜃2 + 𝜃12, 𝜃3 + 𝜃13, 𝜃23) with Cov(𝐷∗, 𝑌∗) = 𝜃23

Thus, given the general joint probability function of bivariate distribution for (X, Y)

~BP(𝜃1, 𝜃2, 𝜃0), where 𝜃0 is the covariance parameter between X and Y.

P(X=x, Y=y) = exp(-𝜃1 − 𝜃2 − 𝜃0) 𝜃1

𝑥

𝑥! 𝜃2

𝑦

𝑦! ∑ (𝑥

𝑖)

min(𝑥,𝑦)𝑖=0 (𝑦

𝑖)𝑖! (

𝜃0

𝜃1𝜃2)𝑖 (3-14)

(Johnson and Kotz, 1997)

And the trivariate Poisson distribution with two-way covariance structure (𝑅∗, 𝐷∗ 𝑌∗)

~TP (𝜃1, 𝜃2, 𝜃3, 𝜃12, 𝜃13, 𝜃23 )

Pr(𝑅∗ = r, 𝐷∗ = d, 𝑌∗ = y) = exp (−𝜃1 − 𝜃2 − 𝜃3 − 𝜃12 − 𝜃13 − 𝜃23)

100

× ∑ {(𝑟 − 𝑧12 − 𝑧13)! (𝑑 − 𝑧12 − 𝑧23)!(𝑧12,𝑧13,𝑧23)∈𝐶

× (𝑦 − 𝑧13 − 𝑧23)! 𝑧12! 𝑧13! 𝑧23!}−1

× 𝜃1𝑟−𝑧12−𝑧13𝜃2

𝑑−𝑧12−𝑧23𝜃3𝑦−𝑧13−𝑧23𝜃12

𝑧12𝜃13𝑧13𝜃23

𝑧23

(3-15)

Where the summation is over the set C∈ 𝑁3 defined as

C=[(𝑦12, 𝑦13, 𝑦23) ∈ 𝑁3: {𝑦12 + 𝑦13 ≤ 𝑥}∪ {𝑦12 + 𝑦23 ≤ 𝑦} ∪ {𝑦13 + 𝑦23 ≤ 𝑧} ≠ ∅]

(Karlis and Meligkotsidou , 2005)3 (3-16)

Under the assumption that the three stages are interdependent, the probability of an

individual being a non-participant is

Pr(R=0|x) = Pr(R*=0|x) = exp (- (𝜃1 + 𝜃12 + 𝜃13)) (3-17)

The probability of an individual being a potential-consumer is:

Pr(D=0, R=1|x, z) = ∑ 𝑃𝑟 (𝑅∗ = 𝑗，𝐷∗ = 0)∞𝑗=1

= exp(-(𝜃2 + 𝜃12 + 𝜃23)) – exp(− (𝜃1 + 𝜃13) −(𝜃2 + 𝜃23)−𝜃12)4 (3-18)

3 In the case of the trivariate Poisson distribution with two-way covariance structure, the variance-covariance matrix

of (𝑅∗, 𝐷∗ 𝑌∗) is as follows

(

𝜃1 + 𝜃12 + 𝜃13 𝜃12 𝜃13

𝜃12 𝜃2 + 𝜃12 + 𝜃23 𝜃23

𝜃13 𝜃23 𝜃3 + 𝜃13 + 𝜃23

)

Then the parameters of 𝜃𝑖𝑗 , i,j=1,2,3, i≠j, have the straightforward interpretation of being the covariance between

the each pair of the variables.

4 Pr (D=0, R=1) = ∑ 𝑃𝑟(𝑅∗ = 𝑗，𝐷∗ = 0)∞𝑗=1

= ∑ 𝑃𝑟(𝑅∗ = 𝑗，𝐷∗ = 0)∞𝑗=0 − 𝑃𝑟(𝑅∗ = 0，𝐷∗ = 0)

= 𝑃𝑟(𝐷∗ = 0) - 𝑃𝑟(𝑅∗ = 0，𝐷∗ = 0)

= exp(-(𝜃2 + 𝜃12 + 𝜃23)) – exp(− (𝜃1 + 𝜃13) −( 𝜃2 + 𝜃23)− 𝜃12)

101

And the probability of an individual being observed with positive consumption intensity

y is

Pr(Y = y|x, z, w)= ∑ ∑ 𝑃𝑟(∞𝑗=1

∞𝑘=1 𝑅∗ = 𝑗, 𝐷∗ = 𝑘, 𝑌∗ = y)

= exp(-(𝜃3 + 𝜃13 + 𝜃23)) −exp(−(𝜃1+𝜃12)−(𝜃3+𝜃23)−𝜃13)∗(𝜃3+𝜃23)𝑦

𝑦!

−exp(−(𝜃2+𝜃12)−(𝜃3+𝜃13)−𝜃23)∗(𝜃3+𝜃13)𝑦

𝑦!

+ exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝜃3

𝑦

𝑦! (3-19)

Where y=1,2,3,…. 5

Thus, under the assumption of interdependence of the three stages, the probability of

observing a non-participant is exp (- (𝜃1 + 𝜃12 + 𝜃13)) and the probability of observing a

potential-consumer is exp(-(𝜃2 + 𝜃12 + 𝜃23))– exp(−(𝜃1 + 𝜃13) −( 𝜃2 + 𝜃23)−𝜃12).

Considering the likelihood function, the parameters will be redefined as

(𝜃1, 𝜃2, 𝜃3, 𝜃12, 𝜃13, 𝜃23) in the case of interdependence, where 𝜃12, 𝜃13, 𝜃23 are the correlation

parameters between each pair of stages. A wald test of 𝜃𝑖𝑗 = 0 i,j∈{1,2,3}, i<j will be employed

5 In order to derive Pr(Y=y), we employ the marginal distribution of 𝑌∗~𝑃𝑜((𝜃3 + 𝜃13 + 𝜃23), and the marginal

distribution of (𝑅∗, 𝑌∗) ~ BPoisson (𝜃1 + 𝜃12, 𝜃3 + 𝜃23, 𝜃13) and (𝐷∗, 𝑌∗) ~ BPoisson (𝜃2 + 𝜃12, 𝜃3 + 𝜃13, 𝜃23)

Pr(Y = y)= ∑ ∑ 𝑃𝑟(∞𝑗=1

∞𝑘=1 𝑅∗ = 𝑗, 𝐷∗ = 𝑘, 𝑌∗ = y)

= Pr(𝑌∗ = y) − Pr(𝑌∗ = y, 𝑅∗ = 0) − Pr(𝑌∗ = y, 𝐷∗ = 0)

+ Pr(𝑌∗ = y, 𝐷∗ = 0, 𝑅∗ = 0)

= exp(-(𝜃3 + 𝜃13 + 𝜃23)) - exp(−(𝜃1+𝜃12)−(𝜃3+𝜃23)−𝜃13)∗(𝜃3+𝜃23)𝑦

𝑦!

−exp(−(𝜃2+𝜃12)−(𝜃3+𝜃13)−𝜃23)∗(𝜃3+𝜃13)𝑦

𝑦!

+ exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝜃3

𝑦

𝑦!

102

to test for the independence between each pair of stages. The likelihood function under

interdependence is as follows in Equation 3-20

f(R𝑖, D𝑖 , Y𝑖, | 𝜃1,𝜃2,𝜃3) =

[exp (− (𝜃1 + 𝜃12 + 𝜃13))]1[R𝑖=0]

*

(

(

[exp(−(𝜃2 + 𝜃12 + 𝜃23)) – exp(− (𝜃1 + 𝜃13) − (𝜃2 + 𝜃23) − 𝜃12)]

1[D𝑖=0]

[ exp(−(𝜃3 + 𝜃13 + 𝜃23)) −

exp(−(𝜃1 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃13) ∗ (𝜃3 + 𝜃23)𝑦

𝑦! exp(−(𝜃2 + 𝜃12) − (𝜃3 + 𝜃13) − 𝜃23) ∗ (𝜃3 + 𝜃13)

𝑦

𝑦!

+ exp(−𝜃1 − 𝜃2 − 𝜃3 − 𝜃12 − 𝜃13 − 𝜃23) ∗ 𝜃3

𝑦

𝑦! ] 1[D𝑖=1]

)

)

1[R𝑖=1]

(3-20)

Where 𝜃1 = exp(𝑥′𝛽), and 𝛽 are the parameters on x in the first stage, 𝜃2 = exp (𝑧′𝛼),

and 𝛼 are the parameters on z in the second stage, and 𝜃3 = exp (𝑤′𝛾) with 𝛾 being the set of

parameters on 𝑤 in the third stage.

Furthermore, from the likelihood model, we also calculate the expected probability of

observing different levels of consumption: the probability of observing a non-consumer is

expressed in Equation 3-21; and the probability of observing a potential consumer is expressed in

Equation 3-22

Pr (Non-consumer) = Pr(𝑅𝑖=0|𝑥𝑖)＝ exp (- (𝜃1i + 𝜃12 + 𝜃13)) (3-21)

Pr( Potential-consumer) =Pr(𝐷𝑖=0, 𝑅𝑖=1|𝑥𝑖, 𝑧𝑖)

=exp(-(𝜃2i + 𝜃12 + 𝜃23)) – exp(− (𝜃1i + 𝜃13) −(𝜃2i + 𝜃23)−𝜃12)

(3-22)

Marginal Effects and Interpreting Results

The overall effect of a given explanatory variable is determined by several different sets

of marginal effects. For example, marginal effects of an explanatory variable can be determined

on the probability of being “non-consumers” Pr(R=0), the probability of being a potential

103

consumer, and on the probabilities for different levels of consumption Pr(Y=j). Calculating

marginal effects for each stage of decisions allow for comparisons between non-consumers and

potential-consumers (which has been lacking from previous models).

The marginal effect of a dummy variable is calculated as the difference between the

probabilities given the dummy variable equals to 1 or 0. As for continuous variables, the

probability expressions provided for each consumer category can be found from the numerical

derivatives. Note that the explanatory variables in the three different stages might be not the

same. Thus the explanatory variable of interest may appear in only one or two of x, z and w, or

in all of them. For a continuous variable 𝑥𝑘 the marginal effect on the participation intention is in

Equation 3-23, which only relates to the explanatory variables in x, and is given by:

𝑀𝐸Pr(𝑅=0) =𝜕Pr (𝑅=0)

𝜕𝑥𝑘 = exp (- (𝜃1 + 𝜃12 + 𝜃13))*（-𝜃1）*𝛽𝑘 (3-23)

To derive the marginal effects on the overall probabilities for the sample-selected hurdle

count data model with interdependence, we need to partition the explanatory variables and the

associated coefficients as follows given the possible existence for only one or two of x, z and w:

𝑥′ = (𝑢′, �̃�′), 𝛽′ = (𝛽𝑢′, 𝛽′̃)

𝑧′ = (𝑢′, �̃�′), 𝛼′ = (𝛼𝑢′, 𝛼′̃)

and 𝑤′ = (𝑢′, �̃�′), 𝛾′ = (𝛾𝑢′, 𝛾′̃)

where u represents the common variables that appear in all the x, z, and w, with

associated coefficients as 𝛽𝑢, 𝛼𝑢 and 𝛾𝑢 for the participation intention, consumption intention,

and consumption frequency equations, respectively. �̃� denotes the distinctive variables that only

appear in the participation stage, with 𝛽 as the associated coefficients; similarly, �̃� and �̃� denote

the variables that only appear in the consumption decision stage and consumption frequency

stage, with 𝛼′ and 𝛾′as the associated coefficients, respectively.

104

In order to express the marginal effects for the entire model, the unique explanatory

variables are expressed as 𝑥∗′ = (𝑢′, �̃�′, �̃�′, �̃�′) and the associated coefficient vectors set for the

three stages are expressed as 𝛽∗′ = (𝛽𝑢′, 𝛽′̃, 0′, 0′), 𝛼∗′ = (𝛼𝑢

′, 0′, 𝛼 ′̃, 0′), and 𝛾∗′ = (𝛾𝑢′,

0′, 0′, 𝛾′̃).

The marginal effect of the explanatory variable vector 𝑥∗ on the consumption probability

is in Equation 3-24, which relates to the explanatory variables in both x and z:

𝑀𝐸Pr(𝐷=0|𝑅=1;𝑥,𝑧)= exp (- (𝜃2 + 𝜃12 + 𝜃23))*（-𝜃2）*𝛼∗- exp (- (𝜃1 + 𝜃2 + 𝜃12 + 𝜃23 +

𝜃13))*[- 𝜃1 ∗ 𝛽∗ − 𝜃2 ∗ 𝛼∗] (3-24)

The marginal effect of the explanatory variables on the positive level of consumption y

(y=1,2,…) is as follows in Equation 3-25

𝑀𝐸Pr(𝑌=𝑦|𝑅=1,𝐷=1;𝑥,𝑧,𝑤)

= exp ((− (𝜃3 + 𝜃13 + 𝜃23)) ∗ (−𝜃3) ∗ 𝛾∗

−exp(−(𝜃1 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃13) ∗ (𝜃3 + 𝜃23)

𝑦 ∗ 𝑦 ∗ (𝜃3 + 𝜃23)𝑦−1 ∗ 𝜃3 ∗ 𝛾∗

𝑦!

−(𝜃3 + 𝜃23)

𝑦 ∗ exp(−(𝜃1 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃13) ∗ (−𝜃1 ∗ 𝛽′ − 𝜃3 ∗ 𝛾∗)

𝑦!

−exp(−(𝜃2 + 𝜃12) − (𝜃3 + 𝜃23) − 𝜃23) ∗ (𝜃3 + 𝜃13)

𝑦 ∗ 𝑦 ∗ (𝜃3 + 𝜃13)𝑦−1 ∗ 𝜃3 ∗ 𝛾∗

𝑦!

−(𝜃3 + 𝜃13)

𝑦 ∗ exp(−(𝜃2 + 𝜃12) − (𝜃3 + 𝜃13) − 𝜃23) ∗ (−𝜃2 ∗ 𝛼′ − 𝜃3 ∗ 𝛾∗)

𝑦!

+exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝑦∗𝜃3

𝑦−1

𝑦!

+exp(−𝜃1−𝜃2−𝜃3− 𝜃12−𝜃13− 𝜃23)∗𝜃3

𝑦∗（−𝜃1∗𝛽′−𝜃3∗𝛾∗−𝜃2∗𝛼′）

𝑦!

(3-25)

The marginal effects for the triple hurdle count data model with no interdependence are

calculated as above but with 𝜃13 = 𝜃23 = 𝜃12 = 0.

105

The standard errors of the marginal effects could be calculated by the Delta Method or

simulated asymptotic sampling techniques. Considering the complexity of the marginal effects,

the sampling technique is used in this case. To be more specific, we randomly draw θ (where θ is

the parameters in the Sample-selected Zero-inflated model) from MVN (θ̂ , 𝑣𝑎𝑟[̂θ]) 10,000

times, and for each draw we calculate the marginal effects based on Equation 3-23 to Equation 3-

25, and then calculate the standard errors. These empirical standard deviations of the simulated

marginal effects are the valid asymptotic estimates of the true marginal effects’ standard errors.

Comparing Triple Hurdle Count Data Model and the Double Hurdle Models

One goal of this study is to discuss the difference in insights gained when differentiating

potential consumers from non-consumers, and employing the triple hurdle model instead of the

double-hurdle approach.

The double-hurdle alternative is similar to the model proposed by Shonkwiler and Shaw

(1996), which assumes that the factors influencing consumption frEquency decision are the same

as the factors influencing the consumption intention.

In Shonkwiler and Shaw’s model, the probability of observing an non-participant is:

Prob (D = 0|x)= Pr(D*=0|x)=exp (- (𝜃1 + 𝜃12)) (3-26)

and the probability of observing an individual who is a potential consumer is:

Pr(D = 1, Y = 0|x, w) = ∑ 𝑃𝑟 (𝐷∗ = 𝑗，𝑌∗ = 0)∞𝑗=1 =

exp(-(𝜃2 + 𝜃12)) – exp(− 𝜃1 − 𝜃2 − 𝜃12) (3-27)

The probability of observing positive consumption frequency is:

Pr(D = 1, Y = y|x,w)= ∑ 𝑃𝑟 (𝐷∗ = 𝑗，𝑌∗ = y) =∞𝑗=1

exp(−𝜃2−𝜃12))∗((𝜃2+𝜃12)

𝑦

y! -

exp(−𝜃1−𝜃2−𝜃12))∗𝜃2𝑦

y! (3-28)

106

Where 𝜃1 = exp(𝑥′𝛽), 𝛽 are the parameters on x in the market participation stage, and

𝜃2 = exp (𝑤′𝛾) with 𝛾 are the set of parameters on 𝑤 in the consumption stage. In this case, there

is an assumption that the probability of consumption intention and the probability of

consumption frequency are related to the same explanatory factors (𝑤) in similar ways.

Although this double-hurdle approach is non-nested with the triple hurdle model, a

generalized likelihood ratio (LR) statistic could be used, with degrees of freedom being given by

the number of additional parameters estimated in the more general model. Additionally, in such a

non-nested situation, information based model selection criteria such as AIC and BIC are

appropriate for choosing between alternative models. These are given by AIC=-2ln(θ)+k, and

BIC=-2ln(θ)+(lnN)*k, where k is the total number of parameters estimated and ln(θ) is the

maximized log-likelihood function. The preferred model is that with smallest value.

Variables and Data

Data Set

In this chapter, the triple hurdle count data model was fit using an online survey about

consumers’ consumption behavior and preferences for fresh blueberries. The survey was

conducted with a random panel of respondents starting in September 2010 and lasted for 12

months, with approximately 350 participants recruited on a monthly basis. The target

respondents are primary grocery shoppers in the Eastern States of the United States. Respondents

answered a series of questions on how often and why (or why not) they purchase fresh

blueberries.

Here, for modeling purposes, non-consumers and potential consumers were distinguished

using survey design. Respondents were first asked whether they had ever purchased fresh

blueberries and then asked whether they had purchased fresh blueberries in the past month. For

those respondents who had purchased in the previous month, they were further asked to indicate

107

how many times they purchased fresh blueberries in the past month. Purchase information was

only asked for the past month to ensure accuracy of the data as it is difficult for people to recall

purchases more than one month ago. By asking respondents whether they have purchased fresh

blueberries before and whether they had purchased fresh blueberries last month in two questions,

“non-consumers” and “potential consumers” can be differentiated according to the definition of

the three types of consumers given above.

The questionnaire consisted of four parts. The first part included questions concerning

consumers’ frequency of consuming fresh blueberries. The second part focused on the reasons

for consuming (or not consuming) fresh blueberries. The third part of the questionnaire focused

on the consumers’ awareness of health benefits of eating fresh blueberries and the last part

includes socio-demographic variables, such as gender, age, educational level, employment status,

family size, socioeconomic status, etc.

Variables

The key dependent variables are ANYPARTICIPATE, ANYCONSUME, and

PURCHASEFREQ. The three variables are derived from the following questions, respectively:

“Have you ever purchased fresh blueberries?” (where a binary “Yes/No” answer is required);

“Have you purchased fresh blueberries in the LAST MONTH?”; and “In the last month,

approximately how many times did you purchase fresh blueberries?”. For the final question,

respondents selected an answer from the categories 1 or 2 times; 3 or 4 times; 5 or 6 times; more

than 6 times; and did not purchase (though they were not shown the question if they indicated no

purchase, this was used as a consistency check). Thus, the use of these three dependent variables

corresponds to examining a three-step decision made with respect to participation and

consumption. As noted earlier, by asking the three questions in sequence, it allows identification

of non-participants, potential consumers, and consumers.

108

The covariates employed in the model are shown in Table 3-1, together with their means.

In addition, descriptions of each variable, and whether the variable was employed in the

participation stage (P), consumption intention stage (C), and consumption frequency stage (F)

are also indicated in this table.

The individual characteristics include gender, education level, race, age and awareness of

health benefits of blueberries. In this dataset, only 35.7% of the respondents are male, which was

expected as only primary grocery shoppers for the household completed the survey. Education

level is controlled for with a binary dummy variable indicating whether the respondents have a

four-year college degree or not (40.6% of the participants have earned at least undergraduate

degree). Consumers’ awareness of health benefits of blueberries was controlled for by using a

dummy variable, which allows us to test the effectiveness of knowledge of health benefits on

consumption decisions. In this dataset, 51.9% of participants indicated that they were aware of

specific health benefits of blueberries.

Together with individual characteristics, household characteristics are also controlled,

including the number of people in the household, whether there are children living in the

household, household income, and household food budget per week. Both household income and

household food budget per week are included based on previous research that indicates income

level works as a social class proxy for consumption participation, and food budget works more

closely influencing consumption frequency.

The last set of variables is a ranking of how important the respondent finds different

attributes of blueberries, including price and taste. Since the consumption of blueberries changes

significantly over the year, we also include seasonal dummy variables.

109

Results

The estimated probabilities of different types of consumers for the fresh blueberry

consumption from both the Double-hurdle and Triple-hurdle approaches are presented in Table

3-2. The predicted probability of non-consumers of fresh blueberry from the triple-hurdle model

is 21.26% (compared to the observed percent of 20.06%), and the predicted probability of non-

consumers of fresh blueberry from the double-hurdle model is 19.83%; The estimated predicted

probability of potential consumers from the triple hurdle model is 27.03% (compared to the

observed percent of 26.84%) and the predicted probabilty of potential consumer from the double-

hurdle approach is 23.60%; The estimated predicted probability of consumers from the triple

hurdle model is 51.71% (compared to the observed percent of 53.10%), and the predicted

probabilty of consumer from the double-hurdle approach is 56.67%. This demonstrates that the

triple-hurdle approach has better prediction of the different type of consumers, specifically, the

double-hurdle approach tends to significantly underestimate the percentage of potential

consumers in the market.

Summary statistics from both the Double hurdle (DH) model and the Triple hurdle (TH)

model are presented in Table 3-3. The DH model is conditional only on X and W, which forces

the assumption that the same set of explanatory factors influence consumption intention and

consumption frequency. The TH model is conditional on X, Z, and W, which allows potential

consumers to be differentiated from consumers. The likelihood ratio statistics from the models of

fresh blueberry consumption reject the DH model. Furthermore, both the AIC and BIC

information criteria suggest the superiority of the TH model over the DH. We, therefore, focus

the discussion on results from the triple hurdle count data model but include discussion on the

insights gained from using the triple hurdle approach compared to the double hurdle approach.

110

Regression results of Triple Hurdle Count Data Model Results

Triple-hurdle count data model estimation results are displayed in Table 3-4 together with

the estimation results from the double-hurdle approach. Coefficient estimates for factors

associated with the probability of having a positive market participation intention (Stage 1) are

displayed in Column 1; coefficient estimates for factors associated with positive consumption

intention are displayed in Column 2; and coefficient estimates for factors associated with

positive consumption levels given positive market participation intention and consumption

intention are shown in Column 3.

First, focusing on the demographic characteristics, some were found to have consistent

results from both the DH and TH models. For example, both models indicate that females are

likely to purchase fresh blueberry more frequently than males and there is no significant effect

detected on market participation and purchase intention. Another variable, age is significantly

negatively correlated with market participation intention, consumption intention, and also

consumption intensity in the triple hurdle model, which indicates that younger people are more

likely to consume fresh blueberries and also more likely to consume them more frequently, and

the results from the double hurdle model indicate similar results. Weekly food budget is

significantly positive in the triple hurdle model across all the three stages, and similar results are

found from the double hurdle model, indicating that people with higher food budgets are more

likely to try fresh blueberries, more likely to purchase blueberries during grocery shopping, and

they will purchase fresh blueberries more often.

First, focusing on the demographic characteristics, some were found to have consistent

results from both the DH and TH models. For example, both models indicate that females are

likely to purchase fresh blueberry more frequently than males and there is no significant effect

detected on market participation and purchase intention. Another variable, age is significantly

111

negatively correlated with market participation intention, consumption intention, and also

consumption intensity in the triple hurdle model, which indicates that younger people are more

likely to consume fresh blueberries and also more likely to consume them more frequently, and

the results from the double hurdle model indicate similar results. Weekly food budget is

significantly positive in the triple hurdle model across all the three stages, and similar results are

found from the double hurdle model, indicating that people with higher food budgets are more

likely to try fresh blueberries, more likely to purchase blueberries during grocery shopping, and

they will purchase fresh blueberries more often.

Results differed between the two models for other demographic variables. Results from

the DH model indicate Caucasians are more likely to try fresh blueberries compared with other

races, but are likely to purchase fresh blueberries less often. The first part remains the same for

the TH model, with Caucasians are more likely to try fresh blueberries. However, results from

the TH find Caucasians less likely to purchase blueberries with no relationship to consumption

frequency. What is more, results from the double hurdle model indicate that Hispanics are less

likely to purchase fresh blueberries, while from the triple hurdle model, we find that Hispanics

are less likely to participate in the fresh blueberry market (thus have a higher probability of being

observed as non-consumer). The variable for Hispanic was not significantly related to

consumption intention nor consumption frequency decisions in the TH model. Education was not

significantly related to market participation or consumption frequency in the TH model, and was

negatively related to consumption intention. In the DH model, education was positively related

to participation, indicating those with more education were more likely to purchase fresh

blueberries. Income is statistically significantly positive in the double hurdle model on the

participation stage. However, no significant correlation between household income level and

112

fresh blueberry consumption is detected in any of the three stages in the triple hurdle model.

Considering consumers’ food habits, being vegetarian was found to be significantly and

positively correlated with fresh blueberry purchase in the double hurdle model, and was

significantly positively correlated with market participant intention in the TH, but negatively

correlated with consumption intention.

When looking at the household characteristics, from the double hurdle model, families

with children would be more likely to purchase fresh blueberry and no more likely than others to

purchase blueberries more frequently. This is similar to the TH model which finds consumers

who indicate they have children living in the household are more likely to participate in the fresh

blueberry market, yet there is no relationship with consumption intention nor purchase

frequency. The difference is in the distinction that the TH shows the relationship is in the

participation decision, not the consumption decision, while both agree there is no relationship

with consumption frequency. The number of people living in the household is not significant in

either of the two stages in the double hurdle model; while in the triple hurdle model, the

estimated coefficient of number of people living in the household is significantly negatively

correlated with market participation, indicating that households with larger family size would be

less likely to participate in the market.

The estimated results indicated that consumers who were aware of the health benefits of

blueberries are significantly more likely to purchase fresh blueberries and purchase with the

higher frequency in the double hurdle model. When looking at the results from the triple hurdle

model, we found that consumers’ awareness of health benefits only significantly influences the

decision of market participation, not consumption intention. Similar to the DH model, the TH

113

also indicates that consumers who are aware of health benefits would purchase fresh blueberries

more frequently.

Considering blueberry characteristics, results from the double hurdle model indicate that

participants who answer that taste is an important factor that influences their purchase decision

would be more likely to purchase the fresh blueberries and more likely to purchase at a high

frequency. However, from the triple hurdle model, we find that taste is only significantly

correlated with positive consumption intention and higher consumption frequency, not with

market participation. As for the price influence, the double hurdle model results indicate that

participants who consider price as important are less likely to purchase fresh blueberries, and

purchase fresh blueberries less frequently. In the TH model, results show that consumers who

thought price is important are less likely to have consumption intention for fresh blueberries, and

they are likely to purchase fresh blueberry less frequently, yet there is no significant influence on

consumers’ market participation intention. Moreover, the magnitude of the price coefficient is

found much larger in the consumption intention stage than the consumption frequency stage.

Finally, considering the effect due to the seasonality (the domestic blueberry season in

the United States runs from April to late September), the results from the double hurdle model

indicate that consumers are more likely to purchase fresh blueberries at a higher frequency in

Summer and Fall compared to Winter, with no significant impact on frequency of consumption

between Spring and Winter. Purchase intention is only impacted during Spring, where consumers

are less likely to purchase fresh blueberries. In the triple hurdle model, consumers were more

likely to have a positive market participation intention, purchase intention and purchase at a

higher consumption frequency in Summer compared to Winter. In Fall, although consumers are

more likely to participate in the market, and purchase the fresh blueberries, there is no significant

114

effect detected on consumption frequency. In Spring, consumers are less likely to purchase fresh

blueberries compared to Winter, with no significant effect on market participation intention and

consumption frequency. With the TH model, it can be observed that consumers are more likely

to purchase and purchase at a higher frequency during the peak season. From the DH model, we

see an increase in consumption frequency during peak season, but not an impact on participation.

Marginal Effects of the Triple Hurdle Count Data Model

As previously mentioned, one of the advantages of the Triple Hurdle Count Data model

is to introduce a degree of flexibility and explore the different generating processes of the three

different types of consumers. This is most easily demonstrated by examining the marginal effects

of different variables. Marginal effects on zero observations using a triple hurdle count data

model, compared with the results from the double hurdle count data model are shown in Table 3-

5. For the triple hurdle model, the overall marginal effect on Pr(y=0) was divided into two parts:

the effect on non-participation (Pr(r=0)), and the effect on the participation with zero

consumption Pr (r=1, d=0). In Table 3-6, marginal effects on the unconditional probabilities of

positive levels of consumption (y=1, 2, 3, 4), using a triple hurdle model versus the double

hurdle model are shown.

The marginal effects (shown in Table 3-5 and Table 3-6) highlight some interesting

results. One example in the case of the fresh blueberry consumption is the impact of the variable

representing if a respondent feels price is an important factor when choosing blueberries. When

examining the influence of the price factor in the Triple Hurdle Count Data model, we see that its

dominant effect is on the probability to be a potential consumer (by 0.178), and that there is no

relationship between the importance of price and the probability to be a non-participant. Thus,

we conclude that when price is identified as an important factor, the likelihood to be a potential

consumer is higher, while the likelihood to be a non-consumer is unaffected. This is as expected

115

as a high price might stop someone interested in purchasing from making that purchase decision,

however a non-consumer is expected to be in their category because of more permanent reasons

(such as allergies). A similar effect is found for the taste factor of blueberries, we see that taste

only influences the probability to be a potential consumer (those that say taste is important are

less likely to consume zero; however, taste does not influence the likelihood of being a non-

consumer. When compared with the double-hurdle approach, the double-hurdle model found that

both the price and taste significantly influenced consumers market participation decisions.

However, the magnitude of the estimated effects on consumption intention from the double-

hurdle approach is much smaller than those found from the triple-hurdle model.

Another example is consumers’ awareness of health benefits. From both the triple hurdle

and double hurdle model, we see that consumers’ awareness of health benefits is only

significantly correlated with consumers’ participation decision. This implies that being aware of

the benefits of blueberries influences the likelihood to try blueberries, as well as the likelihood to

consume more frequently, but does not impact the likelihood to be a potential consumer (once a

consumer decides to participate, they are as likely to be a participant or not regardless of their

awareness of health benefits, but if they do participate and consume, they are likely to consume

more often).

As for the seasonal variables, the triple-hurdle model found that consumers would more

likely to significant seasonal effects for both market participation and consumption intention

decision, but the double-hurdle model indicates no significant correlation between different

seasons and market participation decision.

In summary, from the comparison, we found that the triple-hurdle model introduces more

detailed information concerning the inferences about market participation and consumption

116

decisions than the more restrictive alternative approach. The added information allows us to

distinguish between factors associated primarily with market participation from those primarily

associated with consumption decisions, and factors associated with both decisions.

Conclusion

This study examined the factors associated with consumers’ decision making, using fresh

blueberries as an example. The model presented in this paper was designed to allow three distinct

decisions by consumers: market participation(whether to participate in the market or not);

consumption participation (whether to consume during the certain time period or not); and

consumption frequency (how many/much to consume during the period). By distinguising these

groups the market can be segmented into three different segments of consumers based on the

three decisions: non-participants (those who do not have market participation intention to

participate in the product market); potential-consumers (those who have positive market

participation intention, but are not willing to purchase the product during a given period); and

consumers (those who have been observed with positive consumption during a certain period).

These three segments of consumers are generated by structurally different generating processes,

and therefore a triple hurdle count data model was developed to account for each decision. The

triple hurdle count data model differs from previous models by capturing the different generating

processes of non-consumers and potential consumers. This approach facilitates improved

inference because it accounts for the fact that market participation might be driven by a different

structural process than consumption decisions. This triple hurdle count data model should be

useful in many other applications when there exist non-homogeneous decision-making

processes, and when the output is in the form of count data.

To compare the triple hurdle count data model to the models commonly used to examine

participation and consumption decisions in the market, we also used an extended version of the

117

double hurdle count data model which is similar to Bellemare and Barrett (2006) that is nested in

our triple hurdle count data model as a comparison. The likelihood ratio test, as well as model

selection criteria (AIC and BIC) find that the triple hurdle count data model is preferred

statistically to the double hurdle model. What is more, regarding the model results, a number of

differences are highlighted between the two approaches.

To demonstrate the differences in these two approaches, we applied both the double

hurdle and triple hurdle models to the case of fresh blueberry consumption. The application of

the triple hurdle model to the consumption of fresh blueberries highlights a strong relationship

between consumers’ knowledge and awareness of the health benefits of blueberries and market

participation and consumption frequency, but no significant correlation with the consumption

intention. These results suggest that advertisements and claims of the health and nutrition

benefits of fresh blueberries would be significantly important if policymakers intended to

promote fresh produce consumption, especially to encourage more non-consumers to participate

in the market and start trying fresh produce (in this case, blueberries).

Moreover, the triple hurdle results also indicate that consumer perceptions towards

blueberry characteristics, such as price and taste, only have strong correlation with the

consumption participationand consumption frequency decisions, but no significant effect on

market participation decision, which indicates that improvements in taste in the product will

likely to stimulate potential consumers to consume, and to encourage the consumers to purchase

more, however, it will not significantly change the behavior of non-consumers. Moreover, it also

indicates that the diccounted price or promotions of the product will likely lead to increased

consumption quantity, but not an increased quantity of consumers in the market.

118

Overall, this study demonstrates that there are different factors influencing non-

consumers and potential consumers, thus emphasizing the contribution of using the triple hurdle

count data model. The reasons behind non-consumers are mostly stable demographic variables

like ethnicity, age, income level, family characteristics and consumers’ knowledge of health

information, which will not change quickly. The reasons behind potential consumers are more

related to product characteristics, like taste and price.

119

Figure 3-1. Zero consumption of fresh blueberry per month (created by author)

General consumer sample

Stage 1

Non-participants Market Participants

Stage 2

Potential-consumer Consumers

Stage 3

Consumption Frequency

Figure 3-2. Diagram of the data generating process of the Triple Hurdle Count Data model

(created by author)

0

50

100

150

200

250N

um

ber

of

Obse

rvat

ions

Month

Zero Consumption of Fresh Blueberry

Non-consumer Potential-consumer Zero-consumption

120

Table 3-1. Variable Descriptions

Variables Description Value Model

Male Percent of sample male 35.7% P/C/F

College Percent of sample with at least four-

year college degree

40.6% P/C/F

Age Age in years (continuous in analysis) 18-24 years 13.9% P/C/F

25-29 years 11.1%

30-34 years 10.8%

35-39 years 5.6%

40-44 years 8.2%

45-49 years 8.7%

50-54 years 11.5%

55-59 years 9.4%

60-64 years 9.2%

65 or above 11.6%

Income Estimated Household income $14,999 or less 11.2% P/C/F

$15,000-$24,999 13.5%

$25,000-$34,999 14.7%

$35,000-$49,999 17.4%

$50,000-$74,999 21.0%

$75,000-$99,999 11.6%

$100,000 or

above

10.6%

Hispanic Percent Hispanic 4.0% P/C/F

Black Percent Black/African American 10.1% P/C/F

Asian Percent Asian 3.2% P/C/F

White Percent White 82.3% P/C/F

Otherrace Percent other races 0.4% P/C/F

Health_Aware Percent who are aware of health

benefits of blueberry

51.9% P/C/F

Budget Food budget per week Less than $49 11.5% P/C/F

$50-99 36.1%

$100-149 28.9%

$150-199 13.4%

$200-$249 5.9%

$250+ 4.2%

WithChild Percent who indicate have children live

in the household

34.5% P/C/F

Peop_number People number in the house(continouse

in the analysis)

1-2 55.0% P/C/F

3-4 34.6%

5-6 8.8%

7-8 1.3%

9 or above 0.3%

Taste Percent who indicate taste as a reason

for eating/not eating blueberries

55.2% C/F

121

Table 3-1. Continued

Variables Description Value Model

Price Percent who indicate price as a reason for

eating/not eating blueberries

55.0% C/F

Spring Season Dummy for Spring 23.4% C/F

Summer Season Dummy for Summer 23.9% C/F

Fall Season Dummy for Fall 27.2% C/F

122

Table 3-2. Estimated probabilities for fresh blueberry consumption

Observed Double-hurdle

approach

Triple-hurdle

approach

Non-consumers 20.06% 19.83% 21.26%

Potential consumers 26.84% 23.60% 27.03%

Consumers 53.10% 56.67% 51.71%

Table 3-3. Fresh blueberry consumption: summary statistics from double hurdle approach and

triple hurdle model

Fresh Blueberry Consumption

DH TH

N 4038 4038

K 39 60

Loglikelihood -5237.126 -5064.679

AIC 10513 10189

BIC 10798 10627

LR:TH versus DH 344.894***(df=21) (***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model

with regard to each information criteria is indicated with bold.

123

Table 3-4. Fresh blueberry consumption: regression results

Triple-Hurdel Count Data Model Double-Hurdle Count Data

Model

Explanatory

Variables

Stage 1

Participation

Intention

Stage 2

Consumption

Intention

Stage 3

Consumption

Intensity

Stage 1 &

Stage 2

Participation

Intention

Stage 3

Consumption

Intensity

Female 0.064

(0.057)

0.026

(0.028) 0.430

(0.134)***

-0.063

(0.042) 0.074

(0.030)***

Caucasian 0.340

(0.161)**

-0.178

(0.083)**

-0.082

(0.307) 0.212

(0.104)**

-0.132

(0.068)** Hispanic -1.062

(0.304)**

0.171

(0.103)

-0.681

(0.450) -0.241

(0.118)**

-0.112

(0.080)

Asian -0.071

(0.194)

0.254

(0.133)

-0.700

(0.504)

0.080

(0.150)

0.115

(0.088)

Black 0.141

(0.166)

0.033

(0.089)

-0.116

(0.341) -0.219*

(0.113)

0.104

(0.073)

College 0.003

(0.061) -0.091

(0.026)*

0.165

(0.138) 0.118

(0.043)***

-0.024

(0.030)

Health_Aware 0.812

(0.065)***

0.021

(0.031) 1.234

(0.282)***

0.734

(0.042)***

0.210

(0.031)***

Age -0.055

(0.010)**

-0.032

(0.005)**

-0.123

(0.028)**

-0.017

(0.007)***

-0.036

(0.005)***

Income 0.020

(0.011)

0.004

(0.005)

0.035

(0.032) 0.029

(0.009)***

0.008

(0.007)

Food budget 0.126

(0.019)***

0.090

(0.011)***

0.198

(0.028)***

0.050

(0.015)***

0.104

(0.009)***

Peop_number -0.167

(0.051)**

-0.053

(0.029)

-0.110

(0.097)

-0.028

(0.038)

-0.010

(0.025)

With_child 0.241

(0.074)***

-0.059

(0.038)

0.121

(0.151) 0.142

(0.058)***

0.001

(0.038)

Vegetarian 0.503

(0.181)***

-0.231

(0.042)**

-0.283

(0.354) 0.244

(0.125)*

-0.019

(0.067)

Spring -0.124

(0.085) -0.093

(0.035)**

-0.263

(0.255) -0.103

(0.057)*

0.055

(0.045)

Summer 0.399

(0.082)***

0.422

(0.039)***

0.583

(0.187)***

0.006

(0.056) 0.459

(0.040)***

Fall 0.115

(0.074)***

0.206

(0.036)**

0.026

(0.231)

-0.076

(0.054) 0.236

(0.041)*** Price -0.067

(0.059) -0.443

(0.030)**

-0.190

(0.126)**

-0.178

(0.040)***

-0.293

(0.028)***

Taste -0.062

(0.051) 0.589

(0.023)***

1.155

(0.534)***

0.131

(0.041)***

0.589

(0.036)***

Constant 0.694

(0.233)

0.694

(0.107)***

-3.029

(0.753)**

-0.032

(0.143)

-0.159

(0.105)

124


Triple-Hurdel Count Data Model Double-Hurdle Count Data

Model

Rho(1,2) -0.794(0.055)***

Rho(1,3) 1.092(0.044)*** -0.112(0.022)***

Rho(2,3) 0.140(0.033)***

# of obs 4038 4038

5237.126 Log-

Likelihood

5064.679

(***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model


125

Table 3-5. Marginal Effects for Triple Hurdle Count Data Model

(***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model


Pr(Non-consumer) Pr(Potential-consumer)

Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle

Pr（R=0） Pr(R=0) Pr(D=0,R=1) Pr(D=0,R=1)

Female -0.014

(0.013)

0.018

(0.012)

-0.009

(0.012) -0.025***

(0.009) Caucasian -0.075**

(0.036)

-0.061

(0.029)***

0.083***

(0.035)

0.054

(0.021)

Hispanic 0.234***

(0.064)

0.069***

(0.034)

-0.104***

(0.044)

0.010

(0.025)

Asian 0.016

(0.044)

-0.023

(0.043) -0.106*

(0.054)

-0.024

(0.028)

Black -0.031

(0.037) 0.063*

(0.032)

-0.009

(0.037) -0.047***

(0.023) College -0.001

(0.014) -0.034***

(0.012)

0.037***

(0.011)

0.016*

(0.009) Health_Aware -0.180***

(0.016)

-0.210***

(0.010)

0.018

(0.013)

0.005

(0.008)

Age 0.012***

(0.002)

0.005***

(0.002)

0.011***

(0.002)

0.008***

(0.001) Income -0.004

(0.002) 0.008***

(0.003)

-0.001

(0.002)

0.000

(0.002)

Food budget -0.028***

(0.004)

-0.014***

(0.004)

-0.033***

(0.005)

-0.024***

(0.003) Peop_number 0.037***

(0.011)

0.008

(0.011)

0.016

(0.012)

0.001

(0.008)

With_child -0.053***

(0.017)

-0.041***

(0.017)

0.032*

(0.016)

0.011

(0.012)

Vegetarian -0.111***

(0.041)

-0.070*

(0.036)

0.110***

(0.022)

0.026

(0.021)

Spring 0.027

(0.019) 0.030*

(0.016)

0.034***

(0.015)

-0.024*

(0.013) Summer -0.088***

(0.017)

-0.002

(0.016) -0.159***

(0.017)

-0.125***

(0.012) Fall -0.047***

(0.017)

0.022

(0.015) -0.036***

(0.015)

-0.071***

(0.012) Price 0.015

(0.013) 0.051***

(0.011)

0.178***

(0.012)

0.065***

(0.009) Taste 0.014

(0.012) -0.037***

(0.012)

-0.241***

(0.012)

-0.149***

(0.009)

126

Table 3-6. Comparison of the marginal effects for Triple Hurdle Count Data Model and Double Hurdle Count Model

Pr(Y=1) Pr(Y=2) Pr(Y=3) Pr(Y=4)

Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle

Female -0.030*

(0.014)

-0.010*

(0.006)

-0.039***

(0.013)

0.009

(0.005) -0.041***

(0.013)

0.005**

(0.002)

0.132***

(0.013)

0.002

(0.005)

Caucasian -0.020

(0.036) 0.033**

(0.015)

-0.004

(0.031)

-0.013

(0.012)

0.005

(0.030)

-0.009

(0.005)

0.011

(0.030)

-0.004

(0.012)

Hispanic 0.045

(0.046) -0.034**

(0.018)

0.070*

(0.042)

-0.030

(0.014)

0.064

(0.042) -0.011***

(0.006)

-0.309***

(0.042)

-0.004

(0.014)

Asian 0.127*

(0.058)

0.011

(0.022) 0.079*

(0.047)

0.023

(0.016)

0.066

(0.048)

0.010

(0.007) -0.182***

(0.049)

0.004

(0.016)

Black 0.026

(0.041) -0.033**

(0.017)

0.012

(0.034)

0.008

(0.013)

0.009

(0.034)

0.006

(0.006)

-0.007

(0.033)

0.003

(0.013)

College -0.039**

(0.014)

0.017***

(0.006)

-0.021

(0.014)

0.001

(0.005)

-0.016

(0.014)

-0.001

(0.002) 0.040***

(0.013)

-0.001

(0.005)

Health_ware -0.063**

(0.023)

0.106***

(0.005)

-0.113***

(0.024)

0.068***

(0.004)

-0.117***

(0.024)

0.023***

(0.002)

0.453***

(0.023)

0.007**

(0.004)

Income -0.001

(0.003)

0.004

(0.001)

-0.003

(0.003)

0.003

(0.001)

-0.003

(0.003)

0.001

(0.001) 0.013***

(0.003)

0.000

(0.001)

Child -0.012

(0.017) 0.021**

(0.009)

-0.014

(0.015)

0.006

(0.007)

-0.012

(0.015)

0.001

(0.003) 0.061***

(0.015)

0.000

(0.006)

Vegetarian 0.004

(0.036) 0.036**

(0.019)

0.014

(0.035)

0.008

(0.012)

0.025

(0.035)

0.001

(0.005)

-0.033

(0.035)

-0.000

(0.013)

Spring -0.008

(0.024) -0.015*

(0.008)

0.018

(0.024)

0.005

(0.008)

0.024

(0.024)

0.003

(0.004) -0.095***

(0.024)

0.002

(0.008)

Summer 0.076***

(0.025)

-0.003

(0.008)

-0.031

(0.024) 0.079**

(0.007)

-0.055**

(0.025)

0.036***

(0.003)

0.256***

(0.024)

0.014***

(0.006) Fall 0.035***

(0.024)

-0.013

(0.008)

0.003

(0.023) 0.037***

(0.007)

-0.004

(0.023) 0.018***

(0.003)

0.049**

(0.023)

0.007

(0.007)

127


Pr(Y=1) Pr(Y=2) Pr(Y=3) Pr(Y=4)

Triple-hurdle Double-

hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle Triple-hurdle Double-hurdle

Price -0.101***

(0.013)

-0.024***

(0.006)

-0.009

(0.012) -0.058***

(0.005)

0.016

(0.012) -0.024***

(0.002)

-0.098***

(0.013)

-0.009***

(0.004)

Taste 0.046***

(0.037)

0.014

(0.007)

-0.068

(0.039)

0.106

(0.005)

-0.100***

(0.037) 0.047***

(0.003)

0.348***

(0.003)

0.018***

(0.005)

Age 0.000

(0.039)

-0.002

(0.001) 0.010***

(0.002)

-0.007***

(0.001)

0.012***

(0.002) -0.003***

(0.001)

-0.045***

(0.002)

-0.001

(0.001)

Food

Expenditure 0.012***

(0.005)

0.0065

(0.002)

-0.013**

(0.004)

0.019***

(0.002)

-0.019***

(0.004)

0.009***

(0.001)

0.080***

(0.004)

0.003***

(0.001)

People_num -0.013

(0.012)

-0.004

(0.005)

0.007

(0.010)

-0.003

(0.005)

0.010

(0.010)

-0.001

(0.002) -0.057***

(0.010)

-0.000

(0.004) (***) (**) and (*) indicate statistical significance at 1%, 5% and 10% levels respectively. Preferred model with regard to each information criteria is indicated

with bold.

128

CHAPTER 4

DISCUSSION

Count data has been heavily employed when analyzing consumer behavior and market

segmentation. Consumption count data is quite unique because it usually contains many

observations of zero-consumption, which provides a challenge regarding statistical modeling.

This challenge becomes more prominent when there are potentially different types of zero-

consumption generated from different mechanisms. This dissertation aims to answer the

questions of how to correctly understand, explain, and model the abundant zero consumption

observations when analyzing consumer behavior and market segmentation.

Numerous statistical models have been developed to handle count data with the issue of

zero-inflation and over-dispersion. The most commonly used statistical methods are Zero-

inflated models and Hurdle models, and each model has either a Poisson formulation or Negative

binomial formulation in terms of model specification, with the latter allowing for larger data

over-dispersion. In empirical analysis of consumption behavior, researchers have used various

statistical models based on the nature of the datasets and assumptions, trying to understand the

factors influencing consumer decisions on both market participation and consumption frequency

(Hall, 2004; Hendrix and Haggard, 2015; Almasi et al., 2016).

Although these diverse statistical methods have been extensively employed in the

empirical analysis of consumption data, the performance of these models given different data

characteristics, in particular, different zero proportions are relatively limited. It is indicated that

the most significant difference between the Hurdle models and zero-inflated models is that the

latter allows both structural zeros and sampling zeros while the former assumes only one type of

zero. In marketing analysis, if the data has exclusion criteria (for example, only allowing

consumers to participate in the survey collection), then it will be more appropriate to employ the

129

Hurdle model to handle the case of zero-inflation. However, when the proportion of observed

zero consumption is large (more than 50%), Hurdle models have relative poor predication

capability with large bias. In addition, if the research has no prior underlying assumtpion about

the types of zeros, it is highly suggested to employ Zero-inflated models to handle the issue of

zero-inflation. Considering the distribution formulation, the negative binomial formulation is

preferred to Poisson given the existence of data over-dispersion.

Focusing on the analysis of market segmentation, there are three potential consumer

groups on the market: non-participants, potential consumers, and consumers. The Hurdle model

assumes that consumers should pass two stages before observing a positive consumption

frequency, which means that it only allows zero-consumption to happen in the first stage. In

other words, Hurdle models have limited ability to explain non-participants. Zero-inflated

models also assume a two-stage decision-making process for consumption behavior, and allows

zero-consumption to occur in both stages. This means Zero-inflated models can handle the

existence of both non-consumers and potential-consumers. Thus, if there is an underlying

assumption of three different types of consumers in the market, the zero-inflated models are

more appropriate.

Despite the advantage of Zero-inflated models when both structural zeros (non-

consumers) and sampling zeros (potential consumers) exist, the Zero-inflated models still fail to

differentiate between the two types of zero-consumption. Because of that, Zero-inflated models

still assume consumers make a two-stage decision -– participation intention and consumption

frequency (though consumption frequency can be chosen at zero). This is restrictive as it

assumes that the factors influencing potential consumers and consumers are the same, which

might not always be true. For example, it is possible that promotions exert a larger impact on

130

potential consumers than consumers. Therefore, a Triple Hurdle Count Data model has been

proposed in this study, which allows us to observe the three different groups of consumers.

Based on the three-stage approach, the participation intention is observed in the first stage, and

conditional on the participation decision, consumers would further make the subsequent

consumption intention and consumption intensity decisions. The employment of the Triple

Hurdle Count Data is helpful to provide more detailed information to classify three types of

consumers in the market: non-consumers, potential consumers, and consumers, and explores the

appropriate structurally different reasons explaining the three groups market participation,

consumption intention, and consumption intensity in sequence.

131

LIST OF REFERENCES

Almasi, A., Rahimiforoushani, A., Eshraghian, M. R., Mohammad, K., Pasdar, Y., Tarrahi, M.

J., ... & Jouybari, T. A. (2016). Effect of Nutritional Habits on Dental Caries in

Permanent Dentition among Schoolchildren Aged 10–12 Years: A Zero-Inflated

Generalized Poisson Regression Model Approach. Iranian journal of public

health, 45(3), 353.

Arabmazar, A., & Schmidt, P. (1982). An investigation of the robustness of the Tobit estimator

to non-normality. Econometrica: Journal of the Econometric Society, 1055-1063.

Atkins, D. C., & Gallop, R. J. (2007). Rethinking how family researchers model infrequent

outcomes: a tutorial on count regression and zero-inflated models. Journal of Family

Psychology, 21(4), 726.

Akaike, H. I973. Information theory as an extension of the maximum likelihood principle.

In Second International Symposium on Information Theory. Edited by BN Petrov and F.

Csaki. Akadcmiai Kiado, Budapest, Hungary.

Bandyopadhyay, D., DeSantis, S. M., Korte, J. E., & Brady, K. T. (2011). Some considerations

for excess zeros in substance abuse research. The American journal of drug and alcohol

abuse, 37(5), 376-382.

Bethell, J., Rhodes, A. E., Bondy, S. J., Lou, W. W., & Guttmann, A. (2010). Repeat self-harm:

application of hurdle models. The British Journal of Psychiatry, 196(3), 243-244.

Bezu, S., Kassie, G. T., Shiferaw, B., & Ricker-Gilbert, J. (2014). Impact of improved maize

adoption on welfare of farm households in Malawi: a panel data analysis. World

Development, 59, 120-131.

Binkley, J. K. (2006). The effect of demographic, economic, and nutrition factors on the

frequency of food away from home. Journal of consumer Affairs, 40(2), 372-391.

Dutang, C., Goulet, V., & Pigeon, M. (2008). actuar: An R package for actuarial science. Journal

of Statistical software, 25(7), 1-37.

Calsyn, D. A., Hatch-Maillette, M., Tross, S., Doyle, S. R., Crits-Christoph, P., Song, Y. S., ... &

Berns, S. B. (2009). Motivational and skills training HIV/sexually transmitted infection

sexual risk reduction groups for men. Journal of Substance Abuse Treatment, 37(2), 138-

150.

Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (Vol. 53).

Cambridge university press.

Cannuscio, C. C., Tappe, K., Hillier, A., Buttenheim, A., Karpyn, A., & Glanz, K. (2013). Urban

food environments and residents’ shopping behaviors. American journal of preventive

medicine, 45(5), 606-614.

132

Civettini, A. J., & Hines, E. (2005). Misspecification effects in zero-inflated negative binomial

regression models: Common cases. In annual meeting of the Southern Political Science

Association, New Orleans.

Consul, P. C., & Jain, G. C. (1973). A generalization of the Poisson

distribution. Technometrics, 15(4), 791-799.

Cragg, J. G. (1971). Some statistical models for limited dependent variables with application to

the demand for durable goods. Econometrica: Journal of the Econometric Society, 829-

844.

Crowley, F., Eakins, J., & Jordan, D. (2013). Participation, expenditure and regressivity in the

Irish lottery: Evidence from Irish household budget survey 2004/2005. The Economic and

Social Review, 43(2, Summer), 199-225.

Desouhant, E., Debouzie, D., & Menu, F. (1998). Oviposition pattern of phytophagous insects:

on the importance of host population heterogeneity. Oecologia, 114(3), 382-388.

Desjardins, C. D. (2013). Evaluating the performance of two competing models of school

suspension under simulation-the zero-inflated negative binomial and the negative

binomial hurdle. University of Minnesota.

Duan, N., Manning, W. G., Morris, C. N., & Newhouse, J. P. (1983). A comparison of

alternative models for the demand for medical care. Journal of business & economic

statistics, 1(2), 115-126.

Famoye, F., & Singh, K. P. (2006). Zero-inflated generalized Poisson regression model with an

application to domestic violence data. Journal of Data Science, 4(1), 117-130.

Greene, W. H. (1994). Accounting for excess zero and sample selection in Poisson and negative

binomial regression models.

Greenwood, M., & Yule, G. U. (1920). An inquiry into the nature of frequency distributions

representative of multiple happenings with particular reference to the occurrence of

multiple attacks of disease or of repeated accidents. Journal of the Royal statistical

society, 83(2), 255-279.

Gurmu, S. (1998). Generalized hurdle count data regression models. Economics Letters, 58(3),

263-268.

Hall, D. B. (2000). Zero‐ inflated Poisson and binomial regression with random effects: a case

study. Biometrics, 56(4), 1030-1039.

Hall, D. B., & Berenhaut, K. S. (2002). Score tests for heterogeneity and overdispersion in zero‐inflated Poisson and binomial regression models. Canadian journal of statistics, 30(3),

415-430.

133

Hall, D. B., & Zhang, Z. (2004). Marginal models for zero inflated clustered data.

Statistical Modelling, 4(3), 161-180.

Han, E., & Powell, L. M. (2013). Consumption patterns of sugar-sweetened beverages in the

United States. Journal of the Academy of Nutrition and Dietetics, 113(1), 43-53.

Harris, M. N., & Zhao, X. (2007). A zero-inflated ordered probit model, with an application to

modelling tobacco consumption. Journal of Econometrics, 141(2), 1073-1099.

Hendrix, C. S., & Haggard, S. (2015). Global food prices, regime type, and urban unrest in the

developing world. Journal of Peace Research, 52(2), 143-157.

Hu, M. C., Pavlicova, M., & Nunes, E. V. (2011). Zero-inflated and hurdle models of count data

with extra zeros: examples from an HIV-risk reduction intervention trial. The American

journal of drug and alcohol abuse, 37(5), 367-375.

Huang, H., & Chin, H. C. (2010). Modeling road traffic crashes with zero-inflation and site-

specific random effects. Statistical Methods & Applications, 19(3), 445.

Jackman,S. (2017). pscl: Classes and Methods for R Developed in the Political Science

Computational Laboratory. United States Studies Centre, University of Sydney. Sydney,

New South Wales, Australia. R package version 1.5.1.

URL https://github.com/atahk/pscl/

Jaunky, V. C., & Ramchurn, B. (2014). Consumer behaviour in the scratch card market: a

double-Hurdle approach. International Gambling Studies, 14(1), 96-114.

Jiang, Y., House, L., Tejera, C., & Percival, S. S. (2015, January). Consumption of Mushrooms:

A double-Hurdle Approach. In 2015 Annual Meeting, January 31-February 3, 2015,

Atlanta, Georgia (No. 196902). Southern Agricultural Economics Association.

Johnson, N. L., Kotz, S., & Balakrishnan, N. (1997). Discrete multivariate distributions (Vol.

165). New York: Wiley.

Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in

manufacturing. Technometrics, 34(1), 1-14.

Lesser, L. I., Zimmerman, F. J., & Cohen, D. A. (2013). Outdoor advertising, obesity, and soda

consumption: a cross-sectional study. BMC Public Health, 13(1), 20.

Matheson, F. I., White, H. L., Moineddin, R., Dunn, J. R., & Glazier, R. H. (2012). Drinking in

context: the influence of gender and neighbourhood deprivation on alcohol

consumption. Journal of epidemiology and community health, 66(6), e4-e4.

McCullagh, P., & Nelder, J. A. (1989). Generlised Linear Models, (2nd ed.). London:

Chapman & Hall.

https://github.com/atahk/pscl/

134

Miller, J. M. (2007). Comparing Poisson, Hurdle, and ZIP model fit under varying degrees of

skew and zero-inflation(Doctoral dissertation, University of Florida).

Min, Y., & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated

count data. Statistical modelling, 5(1), 1-19.

Morales, L. E., & Higuchi, A. (2017). Is fish worth more than meat?–How consumers’ beliefs

about health and nutrition affect their willingness to pay more for fish than meat. Food

Quality and Preference.

Morland, K., Wing, S., Roux, A. D., & Poole, C. (2002). Neighborhood characteristics

associated with the location of food stores and food service places. American journal of

preventive medicine, 22(1), 23-29.

Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of

econometrics, 33(3), 341-365.

Neelon, B. H., O’Malley, A. J., & Normand, S. L. T. (2010). A Bayesian model for repeated

measures zero-inflated count data with application to outpatient psychiatric service

use. Statistical Modelling, 10(4), 421-439.

Nelder, J. A. (1989). Generalized linear models.

Nelder, J. A., & Baker, R. J. (1972). Generalized linear models. John Wiley & Sons, Inc.

R Core Team. (2012). R: A language and environment for statistical computing

[Computer software manual]. Vienna, Austria. Retrieved from

http://www.R-project.org/ (ISBN 3-900051-07-0)

Ridout, M., Hinde, J., & DeméAtrio, C. G. (2001). A Score Test for Testing a Zero‐ Inflated

Poisson Regression Model Against Zero‐ Inflated Negative Binomial

Alternatives. Biometrics, 57(1), 219-223.

Rose, C. E., Martin, S. W., Wannemuehler, K. A., & Plikaytis, B. D. (2006). On the use of zero-

inflated and hurdle models for modeling vaccine adverse event count data. Journal of

biopharmaceutical statistics, 16(4), 463-481.

Shonkwiler, J. S., & Shaw, W. D. (1996). Hurdle count-data models in recreation demand

analysis. Journal of Agricultural and Resource Economics, 210-219.

Slymen, D. J., Ayala, G. X., Arredondo, E. M., & Elder, J. P. (2006). A demonstration of

modeling count data with an application to physical activity. Epidemiologic

Perspectives& Innovations, 3(1), 3

Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica:

journal of the Econometric Society, 24-36.

135

Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested

hypotheses. Econometrica: Journal of the Econometric Society, 307-333.

Warton, D. I. (2005). Many zero does not mean zero inflation: comparing the goodness‐ of‐ fit

of parametric models to multivariate abundance data. Environmetrics, 16(3), 275-289.

Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmayer, D. B. (1996). Modelling the

abundance of rare species: statistical models for counts with extra zeros. Ecological

Modelling, 88(1-3), 297-308.

Wedderburn, R. W. (1974). Quasi-likelihood functions, generalized linear models, and the

Gauss—Newton method. Biometrika, 61(3), 439-447.

Wenger, S. J., & Freeman, M. C. (2008). Estimating species occurrence, abundance, and

detection probability using zero-inflated distributions. Ecology, 89(10), 2953-2959.

Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmayer, D. B. (1996). Modelling the

abundance of rare species: statistical models for counts with extra zeros. Ecological

Modelling, 88(1-3), 297-308.

Yen, S. T., & Huang, C. L. (1996). Household demand for Finfish: a generalized double-Hurdle

model. Journal of agricultural and resource economics, 220-234.

Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of

statistical software, 27(8), 1-25.

Zorn, C. J. (1996). Evaluating zero-inflated and hurdle Poisson specifications. Midwest Political

Science Association, 18(20), 1-16.

136

BIOGRAPHICAL SKETCH

Yuan Jiang received her Ph.D. degree from the Department of Food and Resource

Economics at the University of Florida in the spring of 2018. Her research is focused on

consumer behaviors, agricultural marketing, and agribusiness. Prior to commencing her Ph.D.

study at University of Florida, she received her M.S degree in agricultural economics, and M.S

degree in statistics from University of Florida, the B.E degree in economics from Shandong

University, China.

Documents

To my Mom and Dad - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/05/17/08/00001/JIANG_Y.pdf · analysis of consumption behaviors and market structure with excess zeros and