Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Using Online Preference Measurement to Infer
Offline Purchase Behavior
May 6, 2015
Daria Dzyabura
Stern School of Business, New York University, New York, NY 10012
Srikanth Jagabathula
Stern School of Business, New York University, New York, NY 10012
Eitan Muller
Stern School of Business, New York University, New York, NY 10012
Arison School of Business, The Interdisciplinary Center (IDC) Herzliya, 46101 Israel
mailto:[email protected]:[email protected]:[email protected]
1
We would like to thank John Hauser and Oded Netzer for valuable comments and suggestions on
earlier drafts of this paper.
Using Online Preference Measurement to Infer
Offline Purchase Behavior
Abstract
Most preference-elicitation methods that are used to design products and predict market shares
(such as conjoint analysis) ask respondents to evaluate product descriptions, mostly online.
However, many of these products are then sold offline. In this paper we ask how well preference-
elicitation studies conducted online perform when predicting offline consumer evaluation. To
that end, we conduct two within-subject conjoint studies, one online and one with physical
products offline. We find that the weights of the product attributes (partworths) are different in
the online and offline studies, and that these differences might be considerable.
We propose a model that captures this change in weights and derive an estimator for offline
parameters based on the individual respondent’s online parameter, and for population-level
parameters. We demonstrate that such augmentation of online conjoint data with offline data
leads to significant improvement in both individual prediction and estimation of population-level
parameters. We also ask respondents to state their uncertainty about product attributes, and we
find that while respondents anticipate some of the attributes whose weights change, they
completely miss others. Thus this bias might not be accurately detected through an online study.
1
Introduction
In 2013, online market research accounted for more than 85% of the $10 billion spent on
quantitative research in the US (ESOMAR 2014)1. At the same time, overall online sales were
less than 9% of the $3.2 trillion in total US retail sales (Sehgal 2014). Thus, while most
consumer products are sold offline, marketing research is mostly done online. The implicit
assumption is that findings gathered from online marketing research can be used to predict
offline purchasing behavior. If there are systematic differences between preferences elicited
online and offline purchase behavior, then these differences may be consequential when firms
use the results from the research to plan a new product or predict market shares.
There are a number of behavioral reasons why the evaluation of a physical product may
differ from the evaluation of its online description, and thus consumers may assign different
weights to features in the two formats. In general, online and offline channels vary in the types of
information they convey effectively to consumers and in the consumer’s cost of evaluating them.
In this paper, we systematically compare consumers’ online and offline product evaluations
by conducting two within-subject conjoint studies: one online, in which participants evaluate
product descriptions and pictures, and one offline, with physical products. We chose a messenger
bag with fully configurable, discrete features as a product that is well suited for a conjoint study.
We estimated the weights of the product attributes (“partworths”) in a linear compensatory
model, then compared the partworths obtained from the online and offline formats. Our main
results are the following:
1 Qualitative research is still done almost exclusively offline: Of the $3 billion in qualitative research performed in
the US, 99% is done offline, of which the vast majority is focus groups (ESOMAR 2014).
2
Of the ten partworths parameters estimated, eight changed significantly from the online to
offline studies.
We propose a method of correcting for this online/offline discrepancy, which is based on
maximizing the conditional likelihood of the task of interest (offline), conditioned on data
collected from a different task (online). We show that supplementing online conjoint data
with offline data leads to significant improvement in both individual prediction and
estimation of population-level parameters.
When asked about their uncertainty regarding product attributes, respondents anticipated
some of the attributes whose weights changed, while completely missing others. Therefore,
the bias cannot be corrected or even accurately detected through an online study.
Taking into account the difference between the firm’s online preference elicitation and
offline purchasing behavior of its customers is important for several reasons: The first and most
obvious one concerns marketing research such as product development or predicting market
shares of products that will be sold offline. Research firms prefer online research since offline
conjoint study is costly (as it might require making physical prototypes) and time consuming (as
it requires bringing respondents to offline locations). We demonstrate that supplementing a large
online conjoint study with data from a smaller group of respondents who complete both an
online and offline study will give approximately the same level of accuracy as a large (and
costly) offline study. The data from the smaller group will allow a correction of the online/offline
discrepancies, which can then be applied to the large group.
Aside from the potentially misleading predictions generated by online marketing research,
the discrepancy between online and offline consumer choice behavior has implications for mixed
as well as online retailers, such as Warby Parker and Zappos. Even when shopping online, many
consumers engage in “researcher shopping”, that is, evaluating the product in a brick-and-mortar
store before purchasing online (Neslin and Shankar 2009; Verhoef, Neslin, and Vroomen 2007).
3
For these research shoppers, this discrepancy remains since they would likely use their offline
evaluations (partworths). For mixed and online retailers whose consumers make a purchase
decision online but ultimately evaluate and decide to keep the product based on physical
evaluation upon receiving it, the discrepancy between the two evaluations can lead to increased
product returns (Dzyabura and Jagabathula 2014). Understanding the discrepancy will allow the
retailer to better control for returns of purchased products.
The paper is organized as follows: in the next section we discuss the background of online
and offline preference elicitation, and the following section describes the conjoint studies we
conducted, followed by the model and results. In the subsequent section we propose a correction,
the Inter-task Conditional Likelihood method, and show that it leads to better out-of-sample
prediction of the offline evaluation than simply using the online data. We discuss the value of
using stated uncertainty, and conclude with implications of our study.
Literature Review
Researchers have developed various methods for estimating consumer preferences based
on conjoint studies that ask respondents to rate, rank, or choose among several product “profiles”
or descriptions of the product’s attributes. In a review article, Netzer et al. (2008) present a
framework for looking into recent contributions to this important marketing research tool: (1) the
problem to address; (2) the data collection approach; (3) the estimation of a preference model
and its conversion into action. In this context, our effort is directed at the latter two components
of the framework: data collection and the estimation (or correction) of the preference model. In
these two areas, existing work proposes better data collection and estimation techniques to
improve the reliability of data collected, as well as estimates of parameters. These new
4
techniques include adaptive designs to help avoid respondent fatigue by reducing the number of
questions (e.g. Toubia, Hauser and Simester 2004; Dzyabura and Hauser 2011); incentive
compatibility to motivate the participants and improve the validity of responses (e.g. Ding
2007); Bayesian methods to better account for respondent heterogeneity (e.g. Allenby Arora and
Ginter 1995); inclusion of subjective attributes (Luo, Kannan, and Ratchford 2008); and
incorporating non-compensatory decision rules (Gilbride and Allenby 2004; Yee et al. 2007;
Hauser et al. 2010).
Several papers introduced the idea of supplementing conjoint estimation with additional
data that is external to the conjoint study to improve the quality of parameter estimates,
especially with respect to Bayesian estimation (Yang, Toubia and de Jong 2015, Gilbride, Lenk,
and Brazzel 2008; Luo, Kannan, and Ratchford 2008; Netzer et al. 2008; Feit, Beltramo, and
Feinberg 2010; Bradlow 2005; Sandor and Wedel 2005, 2001). Marshall and Bradlow (2002)
provide a Bayesian approach to combining conjoint data with another data source. In their
approach, the latter data source is used to form a prior distribution. They demonstrate their
approach by using respondents’ self-explicated utility weights to form the prior. Along the same
lines, Dzyabura and Hauser (2011) use a product configurator in conjunction with previous
survey data to form priors for an adaptive non-compensatory preference-elicitation method. A
similar approach is taken by Gilbride, Lenk, and Brazzel (2008), who in a choice-based conjoint
framework argue that for Bayesian estimation, external market shares could be used in order to
compute the prior distribution for the parameters. The current research follows in the same vein,
with the offline study providing an additional data source to supplement the online conjoint data.
One issue that comes up when supplementing conjoint data with data from another source is how
much weight should be given to the two sources. In our case, the weight given to the offline data
5
is proportional to the variance/covariance between the online and offline parameters. For our
purpose, the offline study serves as “external” data, and its weight is proportional to the
variance/covariance between the external and internal data.
We contribute to the field of quantitative preference measurement by proposing a method
to improve the validity of online conjoint studies in predicting offline purchase behavior. We are
the first to systematically investigate the role of the medium on conjoint estimates. The vast
majority of conjoint studies are done on the computer with descriptions of hypothetical products’
attributes (e.g. Allenby, Arora and Ginter 1995; Ding 2007; Evgeniou, Pontil and Toubia 2007;
Jedidi and Zhang 2002; Lenk et al. 1996). While there has been some efforts at making online
conjoint more realistic (Dahan and Srinivasan 2000; Berneburg and Horst 2007), conjoint
literature has not explicitly evaluated whether the typical task format—in which product
descriptions are shown to the consumer as attribute descriptions on the computer—is
representative of the way a respondent would behave if evaluating the physical product with the
same features.
The full description of the conjoint studies is presented next.
Experimental design
In order to systematically compare online and offline product evaluations, we conducted
two within-subject conjoint studies: one online, in which participants evaluate product
descriptions, and the other offline, with physical products. A firm that is considering launching a
new product or a new version of an existing one could use this framework at a prelaunch stage
with prototypes.
6
Product:
The choice of the “right” product is important since we wish to have a product that is
configurable, with discrete attributes, and with just the right price so that the subjects would pay
full attention to their choice. Timbuk2 messenger bags were chosen for the following reasons:
(1) they vary on discrete features, some of which are “touch and feel” features for which we
might expect to see discrepancy between online and offline evaluations; (2) they are fully
configurable, which allowed us to purchase bags with the aim of creating a balanced orthogonal
design for the physical conjoint; (3) they are in the right price range, such that they are expensive
enough for participants to take the decision seriously, but cheap enough that undergraduate
students might be interested in purchasing them; (4) they are infrequently purchased, such that
we can expect that many participants would not be familiar with some of the attributes and not
have well-formed preferences; and finally (5) they are physically small enough for us to be able
to conduct the study in the behavioral lab.
Attributes:
Timbuk2’s website offers a full customization option that includes a number of features
(http://www.timbuk2.com/customizer). We selected a subset of attributes that we expected to be
relevant to the target population and for which there is likely to be some uncertainty on the part
of consumers and respondents. For example, we excluded the Right- or Left-Handed Strap
option since respondents would not have any uncertainty with respect to being left- or right-
handed. In addition, we combined the five color features into one Exterior Design feature that
has four options. To make the study manageable we reduced the number of levels of some of the
features. We therefore have the following six attributes for the study:
- Exterior design (4 options): Black, Blue, Reflective, Colorful - Size (2 options): Small (10 x 19 x 14 in), Large (12 x 22 x 15 in)
http://www.timbuk2.com/customizer
7
- Price (4 levels): $120, $140, $160, $180 - Strap pad (2 options): Yes, No - Water bottle pocket (2 options): Yes, No - Interior compartments (3 options): Empty bucket with no dividers, Divider for files,
Padded laptop compartment
Since we chose the price variable to be continuous, we have a total of 13 discrete attribute
levels for the rest of the attributes. Since we set the default for the dummy variables to zero
(black color, small size, no strap pad, no water bottle pocket, and empty bucket), we’re left with
10 parameters to be estimated (8 discrete, one continuous, and a constant). Using the D-optimal
study design criterion (Kuhfeld, Tobias and Garratt 1994; Huber and Zwerina 1996), we selected
a 20-product design that has a D-efficiency of 0.97.
Participants:
We recruited 122 participants from a university subject pool where respondents signed up
for an individual time slot. Because one of the two studies involved looking at physical bags,
only one person could participate at a time in order to avoid preferences affecting each other.
Incentives:
To ensure incentive compatibility and promote honest responses, participants were told
by the experimenter that they would be entered in a raffle for a chance to win a free messenger
bag. Were they to win, their prize would be a bag that was configured to their preferences, which
the researchers would infer from the responses they provided in the study. This chance of
winning a bag provides incentive to participants to take the task seriously and respond truthfully
with respect to their preferences (Ding 2007, Toubia et al. 2012). We followed the instructions
used by Ding et al. (2011) and told participants that, were they to win, they would be given a
messenger bag plus cash, which together would be valued at $180. The cash component was
intended to eliminate any incentive for the participants to provide higher ratings for more
8
expensive items, in order to win a more expensive prize. Respondents were paid $7 to complete a
30-minute study, plus their chance to win an incentive-aligned prize discussed above. All 122
participants completed the study, that is, completion rate was 100%, not unreasonable for a lab
study.
Conjoint task:
We used a ratings-based task in which respondents rated each bag on a 5-point scale
(Definitely not buy; Probably not buy; May or may not buy; Probably buy; Definitely buy). We
chose a ratings-based task rather than a choice-based task because the latter is much more
complex logistically with physical products, in a study already complex to the participants. Even
when conducting a conjoint study online, choice tasks take as much or more time than ratings
tasks (Huber, Ariely and Fischer 2002; Orme, Alpert and Christensen 1997), and produce less
information than individual product rating tasks (Moore 2004). Conducting choice tasks offline
would be even more time consuming as the experimenter would have to present the respondent
with a set of bags, ask the respondent to choose, then present another set of bags, and so on. For
a comprehensive comparison of ratings and choice based conjoint analysis models, see Moore
2004.
Online task:
The online task was conducted using Sawtooth Software. The first screens walked the
participants through the feature descriptions one by one. After that, respondents were shown a
practice rating question and were informed that this is for practice and their response to the
question would be discarded. The following screens presented a single product configuration,
along with the 5-point scale, and one additional question that was used for another study. An
example screen shot is shown in Figure 1a. Participants could go back to previous screens if they
9
wanted but could not skip a question. Lastly, participants were asked to rate each of the 13
features with respect to what degree they felt they would need to examine a product with this
feature to be able to evaluate it. This was measured on a sliding scale marked “Definitely do not
need to see” to “Definitely need to see”, which correspond to 0 and 100, respectively.
Figure 1a: Sample online conjoint screen shot
Figure 1b: Offline task room setup
10
Offline task:
The offline task was conducted in a room separate from the computer lab in which the
online task had been conducted to ensure that participants could not see the bags while
completing the online task. This task was done individually, one respondent at a time in the
room, so as to avoid a contagion effect. The bags were laid out on a conference table, each with a
card next to it displaying a corresponding number (indexing the item), and the bags were
arranged in the order 1 through 20 (see Figure 1b). The prices were displayed on stickers on
Timbuk2 price tags attached to each bag. The experimenter walked the respondents through all
the features, showing each one on a sample bag.
Model
In order to investigate whether participants’ preferences differ with the online and offline
formats, we allow the partworths to vary by respondent, feature, and format. We use the
following standard specification2 for each individual’s rating of each product in each format:
(1) ∑ ,
where is the rating provided by participant to bag in task format (online or offline);
is the partworth assigned by participant to feature j in task format ; and is the intercept.
Product k is captured by its (J) attribute levels where all the attributes are coded as binary
dummy variables except for the continuous price variables. To capture consumer heterogeneity,
we fit a linear mixed effects (LME) model to the ratings data. That is, we assume that a
respondent’s individual partworths are drawn from a multivariate normal distribution:
2 For example, Green and Srinivasan 1990, Huber 1997, Huber, Ariely and Fischer 2002, Kalish and Nelson 1991.
11
( )
[ ]
To allow for heterogeneity among consumers, we have to estimate the elements of the main
diagonal of , which correspond to , capturing the population variance of partworths of each
feature. Because a key construct in this paper is the correlation between a respondent’s online
and offline partworth for the same feature, we also estimate ( ) for all j.
Since the full matrix is of an order of magnitude of J2
(400 in our case), and since we do not
expect a correlation among different features, we fix at zero the elements of that correspond to
( ) for . Thus we assume that the covariance matrix has
the following structure:
[
]
(
)
(
)
(
( )
( )
( )
)
.
We estimate the LME in equation (1) using maximum likelihood, and use these estimates
for the remainder of the paper.3 The estimates of all features’ fixed effects, (that is, the
population average feature partworths) are reported in Table 1. The estimates of the population
3 Note that while choice based conjoint traditionally requires more complex methods, such as MCMC to estimate the
choice models, ratings tasks can be estimated using classical methods.
12
partworth variance, , and online-offline correlations, ( )
, are reported in
Table 2.
Table 1: Mean population partworths ( )
Attribute
Level
Online Partworth
Offline Partworth
Difference
Exterior design Reflective -0.31** -0.60** -0.28*
Colorful -1.06** -0.71** 0.36**
Blue -0.22** -0.11 -0.12
Black
Size Large 0.27** -0.31** -0.58**
Small
Price $120, $140, $160, $180 -0.22** -0.15** 0.06**
Strap pad Yes 0.51** 0.25** -0.26**
No
Water bottle pocket Yes 0.45** 0.17** -0.28**
No
Interior compartments Divider for files 0.41** 0.52** 0.11
Crater laptop sleeve 0.62** 0.88** 0.26**
Empty bucket/no dividers
Intercept 3.72** 3.39** -0.33
**p
13
Because the partworth of one level of each attribute is normalized to zero, these values
always represent a comparison to the default level. For example, the negative values of the
exterior designs signify that, at the population level, Black is the preferred design.
To appreciate the magnitude of these differences, we calculated the willingness to pay for
the attributes using the methodology of Ofek and Srinivasan (2002). The resultant median
willingness to pay for Strap pad is $43 online and $31 offline; For Water bottle pocket the WTP
is $40 online and exactly half ($20) offline. These represent considerable differences if the firm
is to base its pricing on these findings.
Large population standard deviations signify a great deal of heterogeneity among the
respondents in their preference for the attribute. For example, we can see that there is large
variation in respondents’ preference for Colorful, while there is a relative consensus on Strap
Pad. Also note that the preferences for Reflective and Colorful are more heterogeneous offline
than online. The value of the correlation is a measure of how systematic the bias is. If the
correlation is high, it suggests that if there is an online/offline discrepancy, it is systematic across
respondents. In the extreme case, every respondent’s online partworth estimate would differ from
its offline counterpart by a constant. If the covariance is low, then a respondent’s online
partworth is not a good predictor of her offline partworth.
Our first main result is that the population-level estimates of most features differ by task
format and some are large, suggesting a systematic bias that is being introduced by using online
preference elicitation. This is a major issue if the aim is to make predictions in the offline
environment based on online market research. Both aggregate-level predictions such as market
14
shares and individual predictions such as segmentation or targeting that are based on online
preference elicitation would be incorrect.
Several attributes’ partworths changes are worth noting:
- The single attribute that did not change significantly from the online to offline
scenario is the color Blue. This is likely because the Blue color can be accurately
evaluated based on the image provided in the online task, and Color is a very salient
attribute in both conditions.
- The decrease of the partworths of Water bottle pocket and Strap pad is substantial.
This may be attributed to those attributes being made more salient in the online
condition, when they are stated verbally; offline, on the other hand, they may be
overlooked by participants altogether.
- Only one attribute, Size, changed sign. Thus online respondents preferred the larger
bag but changed their preference to the smaller one once they physically examined
the bags.
- The fact that the intercept’s value does not change implies that there is no feature that
changes upon physical examination that is common to all the bags, such as the
material used.
Since our first main result indicates a substantial online discrepancy in the estimates of the
parameters of interest, we next propose a method to correct for this bias to improve predictions
about consumer offline purchase behavior.
15
Improving predictions of offline purchase behavior
When conducting marketing research with the purpose of product design or market share
prediction, firms typically conduct online conjoint studies with large representative samples of
participants, but with the intent of predicting purchase behavior that will take place offline. Our
results in the previous section provide strong evidence of a discrepancy in partworths
measurement between online and offline evaluations. Consequently, if the product is sold
primarily in brick-and-mortar stores, an online only conjoint study is not sufficient and an offline
conjoint task is required to obtain more accurate predictions. However, conducting large offline
conjoint studies is costly because it involves evaluation of physical products as opposed to online
descriptions. There are few exceptions such as Luo, Kannan and Ratchford 2008, and She and
Macdonald 2013.
We propose to address the above challenge by (a) supplementing a large online conjoint
study with a sample of respondents who complete both the online and offline tasks and (b)
designing a correction that we term the Inter-task Conditional Likelihood correction (ICL),
which uses the supplemented data to improve the offline purchase behavior of the respondents.
We will trade-off the accuracy of our predictions with the cost of data collection by asking a
small number of the respondents from a large conjoint task to complete both the online and
offline conjoint tasks. The number of respondents chosen will allow us to determine the trade
off. The structure of the resulting data set is illustrated in Figure 2, where the shaded area
corresponds to the observed data and the un-shaded area corresponds to the missing data that are
of interest.
16
Figure 2: Data split into estimation and prediction
Training Sample (50%) Hold-out Sample (50%)
Online data used for estimation
Offline prediction task
We use the data from the respondents who completed both the online and offline conjoint
tasks to infer the correlations between the online and offline partworths. We then use these
correlations to infer the missing offline ratings for the other respondents. In order to carry out the
inference, we design the ICL correction, based on maximizing the conditional likelihood of the
task of interest (offline), conditioned on data collected from a different task (online). It is
designed to exploit the correlations that exist between the online and offline partworths, and
therefore, its success depends on the extent of the correlation. We illustrate the ICL method on
the data from the conjoint task described above. Our results demonstrate that the ICL method
works well for (1) the prediction of individual-level offline ratings and (2) the accurate
estimation of the population preference distribution. A key result from our experiments is that
collecting offline and online data for a small number of respondents can result in substantial
improvements in the prediction accuracies.
(1) Individual-level offline ratings prediction
We illustrate our method in the context of the Timbuk2 conjoint study described above.
Because we have collected within-subject data on both the online and offline ratings, we are able
to hold out the offline ratings for a subset of respondents (validation set) and to predict them by
applying the ICL to their online data. To that end, we hold out the offline ratings for a subset of
17
our sample, , and predict them using the model trained on just these respondents’
online data. We use the remainder of the sample, , to train our model of online and
offline partworths, detailed below, which results in an individual-level estimate of the offline
partworth, ̂ , for respondent i and attribute j. We demonstrate that the proposed ICL
correction significantly improves out-of-sample prediction.
We consider two prediction tasks: (a) offline ratings for the 20 bags and (b) individual-
level offline partworths. Depending on the application, one or the other prediction task may be
more relevant. For instance, if the goal were to predict market shares, then the ability to
accurately predict offline ratings is more valuable, whereas if the goal were to segment the
market based on consumer preferences, then accurately predicting partworths is of greater value.
We quantify the prediction gains from our techniques in terms of the popular root mean square
error (RMSE) metric. We report the RMSE metrics for the two prediction tasks: for
the offline ratings prediction task and for the partworths prediction task. The
metrics are defined as follows:
∑ √∑( ̂ )
| |
∑ √∑( ̂ )
| |
where is the set of respondents used for validation, and ̂ is the predicted
offline rating of respondent i for bag k computed according to the individual-level estimate of
the offline partworth, ̂ , for respondent i and attribute j.
18
Table 3 compares the performance of the ICL method against three different benchmarks,
described below. The reported metrics were averaged over 50 random partitions of the data set
into 50% training and 50% validation sets to remove any data partitioning effects. We first
describe the three different benchmark methods and then describe the ICL method in detail.
Table 3: Out-of-sample predictive performance (lower is better)
ICL Online data Partial offline data Full offline data
1.04 1.56 1.18 0.51
0.07 0.21 0.24 0
Full offline data: The first benchmark method provides a lower bound (or the best
performance) on the RMSE metric we can expect. For each individual in the validation set, we
set ̂ and ̂ ∑ , where
is obtained by running ordinary least squares (OLS) method on the offline ratings data. It can be
seen that and is the lowest RMSE we can get with a linear
model. This method uses data not part of our training sample, but provides us with the best
performance we can hope to achieve. Table 3 reports for this method.
Online only data. This method captures the current practice: regardless of the channel in which
the product is to be sold (offline or online), the research is conducted online. For each
respondent, we predict her offline bag ratings using partworths trained on her online conjoint
task., i.e., we use ̂ This results in predicted offline ratings ̂ ,
which are computed according to online estimated partworths :
̂ ∑
19
Because this method corresponds to the way conjoint studies are typically conducted, it is a
reasonable baseline model. We note from Table 3 that the RMSE metrics of this method are
and – significantly higher than the lower bound
from the full offline data. Further, it is clear from the results in Table 3 that we can get
significant improvements – 33% and 67% for ratings and partworths predictions respectively –
over the current practice from incorporating offline data through the ICL method. These
improvements quantify the benefits of our method over the current practice.
Partial offline data: For this method, we used only the offline ratings in the training data.
Specifically, we trained a linear mixed effects model as described in the section above but only
on the offline ratings collected from the participants in the training data. We then used the
estimated population-level parameters as our estimates for the individual-level offline partworths
i.e.., ̂ . These partworths were then used to predict the offline ratings for the
participants in the test data set as follows:
̂ ∑
Note that because we are not using the online observations of the participants, our
partworth and ratings predictions for each product are the same across all the participants. We
note from Table 3 that the RMSE metrics are and .
We observe that the RMSE for the ratings prediction is significantly (about 24%) lower when
compared to the current practice (the online only data benchmark). This finding is indeed very
surprising because it suggests that predicting offline ratings using population-level parameters
and no individual-level information can, on average, be more accurate than actually asking the
participants for their online ratings! This finding reveals the level of discrepancy that exists
20
between the online and offline partworths in the messenger bag setting. Further, it suggests that
the firm may in fact obtain a more accurate understanding of customer’s offline purchase
behavior from a small offline conjoint rather than only a large online conjoint.
We also note that despite outperforming the current practice, the RMSE metrics are still
higher – about 12% and 71% for ratings and partworths predictions respectively – than the ICL
correction method, suggesting that a combination of online and offline conjoint data can
outperform either only offline or online data.
ICL correction: The ICL method computes the expected offline ratings conditioned on all the
observed data. We exploit the properties of multivariate normal distributions in order to compute
the conditional expectations in closed form. Specifically, recall that we assume that each
respondent samples the online and offline partworths , and , jointly from a
bivariate normal distribution:
( )
[
] [
] [ ( )
(
) ]
where ( ) is the covariance between the online and offline partworths of
attribute j.
We use observed data to determine the maximum likelihood estimates of the population-
level parameters , , and ( ) for each
attribute j. Note that the data from the group of respondents who completed both the online and
offline tasks allows us to infer the covariance parameters. Given the population-level parameters,
21
we can show that the conditional distribution of given is a normal
distribution, with mean | and variance | that are given by
|
( )
| √
where is the correlation between the online and offline partworths of attribute j.
Note that | is also the maximum likelihood estimator of , because
of normality. Therefore, under this model, the maximum likelihood estimates of a respondent’s
offline partworths are given by:
̂
( )
Because we have a closed-form expression for the correction, computing it is straightforward
once the population-level parameters have been estimated. Using the ICL correction on our data,
we obtain the of 1.04, and is 0.07. These are substantial
improvements over the uncorrected methods.
(2) Estimation of population-level parameters
We now investigate the ability of our proposed method to obtain accurate estimates of
population-level parameters. As above, we compare the performance of the ICL method to the
three different benchmarks. Population-level partworth parameters allow us to gain an intuitive
understanding of the population’s preference for a certain attribute (e.g., Allenby et al. 2014).
We assume that individual partworth vectors are drawn from a multivariate normal distribution,
22
as described above. Under this assumption, we determine the maximum likelihood estimates of
the population mean and variance of the partworth distribution for each attribute for each of the
three different benchmark methods and the ICL correction method. We quantify the difference in
their performances using the Kullback-Leibler (KL) divergence metric (Kullback and Leibler
1951). Specifically, using the full offline data, we compute the ground-truth estimates of the
mean and variance parameters. We then compute the KL divergence between the distributions
estimated using each of the methods and the ground-truth distributions obtained from the full
offline data. The KL-divergence metrics are reported in Table 4. As is clear from the table, the
estimates resulting from the ICL are close to the ground-truth full offline data estimates,
consistently across all the attributes lending more support to our proposition of supplementing
online data with the offline data. Particularly, the estimates obtained by ICL significantly
outperform those obtained from online only, on all attributes except for the color Blue.
Since the offline conjoint task is challenging and costly to conduct, we explore whether it
can be avoided altogether. Specifically, consumers might have intuitions about which attribute
preferences have a greater possibility of changing from the online to the offline environment. We
tackle this issue next.
Table 4: Population-level parameters – offline, corrected, and online estimates
Attribute
Level
Offline estimates ICL corrected estimates Online only estimates
Mean Variance Mean Variance
KL
(offline,
ICL) Mean Variance
KL
(offline,
online)
Exterior design Reflective -0.6 0.7 -0.61 0.72 0.010 -0.31 0.29 0.146
Colorful -0.71 0.91 -0.7 1.03 0.007 -1.06 0.88 0.073
Blue -0.11 0.25 -0.11 0.16 0.043
-0.22 0.16 0.042
Black
Size Large -0.31 0.28 -0.3 0.31 0.022 0.27 0.18 1.021
Small
Price $120- $180 -0.15 0 -0.16 0 0.081 -0.22 0 0.421
Strap pad Yes 0.25 0.15 0.27 0.14 0.011 0.51 0.11 0.212
No
Water bottle pocket Yes 0.17 0.05 0.15 0.05 0.068 0.45 0.07 0.619
No
Interior compartments Divider for files 0.52 0.04 0.48 0.05 0.065
0.41 0.06 0.149
Crater laptop sleeve 0.88 0.2 0.83 0.31 0.018 0.62 0.25 0.143
Empty bucket
Intercept 3.55 3.44 3.72
Stated Uncertainty
It is possible that consumers are aware that they cannot judge the value of certain attributes
with accuracy. If the online/offline discrepancy occurs to people many times in different
categories, it is possible that consumers have learned to anticipate changing personal preferences.
In that case, we can improve our decision making by asking consumers to self-state the need to
examine each attribute physically in order to accurately judge it. Note that this consumer belief
uncertainty is different from magnitude of the variance around the parameter estimate, which
represents the researcher uncertainty.
In the online task, after rating all the bags, participants were asked to state their certainty
about how well they could judge each feature from the online description. The exact wording of
the question was: “Some of the bag features may be clear to you simply from the description
provided online. Other features you may want to physically examine before making your final
decision. Please rate the following features on how useful it would be for you to examine a
product with this feature in a store.” Each feature was then listed, with a sliding scale ranging
from “Definitely don’t need to see in store” to “Definitely need to see in store”, which
corresponded to uncertainty ratings of 0 and 100, respectively. Price was excluded. While these
scales are not an objective measure of variance and the quantities reported should not be
interpreted in isolation, their relative values are meaningful. Comparing stated uncertainty with
the (absolute) difference between online and offline partworths, we find that stated uncertainty is
a rather poor predictor of the changes that occur. Population averages for the stated uncertainties
are given in Table 5, along with the corresponding features’ online-offline partworth differences.
Note that uncertainty was measured for all attribute levels, while partworths were normalized to
zero for one of the attribute levels.
25
It is clear that while participants can anticipate some of the attributes that will change, such
as Size and the Laptop sleeve, they miss others, such as Water bottle pocket, Strap pad, and
Colorful. Indeed, the correlation of the stated uncertainty and the absolute value of the difference
(for the features for which we have the difference) is rather low (36.8%). Moreover, this
correlation is driven almost exclusively by size: If we compute the correlation of all attributes
excluding size, the resultant correlation disappears ( ).
Table 5: Stated uncertainty and online/offline discrepancies
Attribute Level
Uncertainty
Pre-evaluation
Difference in
Partworths
Exterior design Reflective 49.6 -0.30
Colorful 49.5 0.36
Blue 42.4 0.08
Black 43.5
Small Large 75.7 -0.54
Small 75.4
Strap pad Yes 58.5 -0.29
No 30.3
Water bottle pocket Yes 45.6 -0.29
No 24.4
Interior compartments Divider for files 63.0 0.14
Crater laptop sleeve 70.2 0.25
Empty bucket/no dividers 30.9
Conclusions and Implications
In this work we challenged the implicit assumption commonly made in market research
that findings collected from online research can be used to accurately predict offline behavior.
Consumers’ product evaluations from an online conjoint study with verbal product descriptions
as well as pictures were compared to an offline study with physical products. We found that the
vast majority of partworth parameters changed significantly from online to offline studies. To
26
correct for this disparity, we offered a method based on maximizing the likelihood of the offline
task, conditional on data collected from the online task. We showed that this estimator leads to
better out-of-sample prediction than using uncorrected online data.
In this paper we used primary data in order to carefully control for all factors and zero in on
the online/offline distinction. But the higher-level problem of predicting a consumer’s offline
preferences, given the same consumer’s online preferences, and other consumers’ online and
offline preferences has implications beyond online preference elicitation. Consider an online
retailer, such as Warby Parker, Zappos, or Bonobos. When consumers purchase from these
retailers, they decide what to order based on their online evaluation of the available items.
However, once they receive their order, they determine what they want to keep based on physical
evaluation. These and other online retailers typically have a very generous returns policy, so that
customers may try on several items before purchasing one. Warby Parker even offers a free
“Home Try-On” program in which customers may order several eyeglass frames to try at home,
return all, and then order the prescription lens to go with the chosen frames. Thus, the firm has
some data on both online and offline preferences for customers who have a history with Warby
Parker. When a potential new customer (who has not yet evaluated the firm’s products
physically) orders some items, the firm knows only the online behavior. In a sense the retailer
has more information than the single consumer. In addition to this website visitor’s online-
preference data, the retailer can use for estimation all the information gathered from current
customers, including online and offline preferences, to obtain an estimate on the offline
preference of the new customer. Note that the data scheme given in Figure 3 is similar in nature
to the one used in the data split in Figure 2, since the data used for estimation and prediction are
broken along the exact same lines.
27
Figure 3: Schematic data available to a typical online retailer
Existing customers New customers
Online data used for estimation
Offline prediction task
In the area of recommendation systems, a common problem structure is that a substantial
amounts of data are available on some customers, while only very sparse data are available on
others, and using the former to improve prediction about the latter. In our case, the missing piece
is the offline product evaluation. As we have demonstrated, relying only on a customer’s online
preferences to make predictions about his or her offline preferences may be unreliable. However,
having access to both types of preferences for the existing set of customers enables the retailer to
make a better prediction. An online retailer, through programs such as Warby Parker’s Home
Try-On Program, may implement an offline recommendation system by including some
suggested items for a user who is ordering online.
28
References
Allenby, Greg M., Neeraj Arora, and James L. Ginter. 1995. Incorporating prior knowledge into
the analysis of conjoint studies. Journal of Marketing Research 35 (3) 152-162.
Allenby, Greg M., Jeff D. Brazell, John R. Howell, and Peter E. Rossi. 2014. Economic
valuation of product features. Quantitative Marketing and Economics 12 (4) 421-456.
Berneburg, Alma and Bruno Horst. 2007. 3D vs. traditional conjoint analysis: Which stimulus
performs best? Marketing Theory and Application, AMA Winter Educators’ Conference,
Vol. 18, Chicago, IL 112-120.
Bradlow, Eric T. 2005. Current issues and a ‘wish list’ for conjoint analysis. Applied Stochastic
Models in Business and Industry 21 (4‐5) 319-323.
Dahan, Ely and V. Srinivasan. 2000. The predictive power of Internet-based product concept
testing using visual depiction and animation. Journal of Product Innovation Management
17 (2) 99-109.
Ding, Min. 2007. An incentive-aligned mechanism for conjoint analysis. Journal of Marketing
Research 44 (2) 214-223.
Ding, Min, John R. Hauser, Songting Dong, Daria Dzyabura, Zhilin Yang, Chenting Su, and
Steven P. Gaskin. 2011. Unstructured direct elicitation of decision rules. Journal of
Marketing Research 48 (1) 116-127.
Dzyabura, Daria, and Srikanth Jagabathula. 2015. Offline assortment optimization in the
presence of an online channel. Working paper, New York University.
Dzyabura, Daria, and John R. Hauser. 2011. Active machine learning for consideration
heuristics. Marketing Science 30 (5) 801-819.
ESOMAR. 2014. Global Market Research Industry Report. ESOMAR, Amsterdam, The
Netherlands.
Evgeniou, Theodoros, Massimiliano Pontil, and Olivier Toubia. 2007. A convex optimization
approach to modeling consumer heterogeneity in conjoint estimation. Marketing Science 26
(6) 805-818.
Feit, Eleanor. M., Mark A. Beltramo and Feinberg, Fred. M. 2010. Reality check: Combining
choice experiments with market data to estimate the importance of product attributes.
Management Science 56 (5) 785-800.
Gilbride, Timothy J., and Greg M. Allenby. 2004. A choice model with conjunctive, disjunctive,
and compensatory screening rules. Marketing Science 23 (3) 391-406.
Gilbride, Timothy J., Peter J. Lenk, and Jeff D. Brazell. 2008. Market share constraints and the
loss function in choice-based conjoint analysis. Marketing Science 27 (6) 995-1011.
Green, Paul E., and Venkat Srinivasan. 1990. Conjoint analysis in marketing: new developments
with implications for research and practice. The Journal of Marketing 54 (4) 3-19.
Hauser, John R., Olivier Toubia, Theodoros Evgeniou, Rene Befurt, and Daria Dzyabura. 2010.
Disjunctions of conjunctions, cognitive simplicity, and consideration sets. Journal of
Marketing Research 47 (3) 485-496.
29
Huber, Joel. 1997. What we have learned from 20 years of conjoint research: When to use self-
explicated, graded pairs, full profiles or choice experiments. Sawtooth Software Research
Paper Series.
Huber, Joel, Dan Ariely, and Gregory Fischer. 2002. Expressing preferences in a principal-agent
task: A comparison of choice, rating, and matching. Organizational Behavior and Human
Decision Processes 87 (1) 66-90.
Huber, Joel, and Klaus Zwerina. 1996. The importance of utility balance in efficient choice
designs. Journal of Marketing Research 33 (3) 307-317.
Jedidi, Kamel, and Z. John Zhang. 2002. Augmenting conjoint analysis to estimate consumer
reservation price. Management Science 48 (10) 1350-1368.
Kalish, Shlomo and Paul Nelson. 1991. A comparison of ranking, rating and reservation price
measurement in conjoint analysis. Marketing Letters 2 (4) 327-335.
Kuhfeld, Warren F., Randall D. Tobias, and Mark Garratt. 1994. Efficient Experimental Design
with Marketing Research Applications. Journal of Marketing Research 31 (4) 545-557.
Kullback, Solomon, and Richard A. Leibler. 1951. On information and sufficiency. The Annals
of Mathematical Statistics, 22 (1),79-86.
Lenk, Peter J., Wayne S. DeSarbo, Paul E. Green, and Martin R. Young. 1996. Hierarchical
Bayes conjoint analysis: recovery of partworths heterogeneity from reduced experimental
designs. Marketing Science 15 (2) 173-191.
Luo, Lan, P. K. Kannan, and Brian T. Ratchford. 2008. Incorporating subjective characteristics
in product design and evaluations. Journal of Marketing Research 45 (2) 182-194.
Marshall, Pablo, and Eric T. Bradlow. 2002. A unified approach to conjoint analysis
models. Journal of the American Statistical Association 97 (459) 674-682.
Moore, William L. 2004. A cross-validity comparison of rating-based and choice-based conjoint
analysis models. International Journal of Research in Marketing 21 (3) 299-312.
Neslin, Scott A., and Venkatesh Shankar. 2009. Key issues in multichannel customer
management: current knowledge and future directions. Journal of Interactive Marketing 23
(1) 70-81.
Netzer, Oded, Olivier Toubia, Eric T. Bradlow, Ely Dahan, Theodoros Evgeniou, Fred M.
Feinberg, and Eleanor M. Feit. 2008. Beyond conjoint analysis: Advances in preference
measurement. Marketing Letters 19 (3-4) 337-354.
Ofek, Elie and V. Srinivasan. 2002. How much does the market value an improvement in a
product attribute? Marketing Science 21 (4) 398-411.
Orme, Bryan K., Mark I. Alpert, and Ethan Christensen. 1997. Assessing the validity of conjoint
analysis–continued." Sawtooth Software Conference Proceedings.
Sándor, Zsolt, and Michel Wedel. 2001. Designing conjoint choice experiments using managers’
prior beliefs. Journal of Marketing Research 38 (4) 430-444.
Sándor, Zsolt, and Michel Wedel. 2005. Heterogeneous conjoint choice designs. Journal of
Marketing Research 42 (2) 210-218.
30
Sehgal, Vikram. 2014. Forrester Research Online Retail Forecast, 2013 to 2018 (US). Forrester
Research. https://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2013+To+2018+US/fulltext/-/E-RES115941
She, Jinjuan and Erin F. MacDonald. 2013. Trigger features on prototypes increase preference
for sustainability. ASME 2013 International Design Engineering Technical Conferences
and Computers and Information in Engineering Conference. American Society of
Mechanical Engineers, V005T06A043-V005T06A054.
Toubia, Olivier, Martijn G. de Jong, Daniel Stiegr and Johan Füller. 2012. Measuring consumer
preferences using conjoint poker. Marketing Science 31 (1) 138-156.
Toubia, Olivier, John R. Hauser, and Duncan I. Simester. 2004. Polyhedral methods for adaptive
choice-based conjoint analysis. Journal of Marketing Research 41 (1) 116-131.
Verhoef, Peter C., Scott A. Neslin, and Björn Vroomen. 2007. Multichannel customer
management: Understanding the research-shopper phenomenon. International Journal of
Research in Marketing 24 (2) 129-148.
Yang, Liu, Olivier Toubia and Martijn G. de Jong. 2015. A bounded rationality model of
information search and choice in preference measurement. Journal of Marketing Research
52 (2) 166-183.
Yee, Michael, Ely Dahan, John R. Hauser, and James Orlin. 2007. Greedoid-based
noncompensatory inference. Marketing Science 26 (4) 532-549.
https://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2013+To+2018+US/fulltext/-/E-RES115941https://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2013+To+2018+US/fulltext/-/E-RES115941