Using Online Preference Measurement to Infer Offline Purchase …people.stern.nyu.edu/ddzyabur/index_files/Online... · 2015. 5. 6. · Using Online Preference Measurement to Infer

Using Online Preference Measurement to Infer

Offline Purchase Behavior

May 6, 2015

Daria Dzyabura

Stern School of Business, New York University, New York, NY 10012

[email protected]

Srikanth Jagabathula


[email protected]

Eitan Muller


Arison School of Business, The Interdisciplinary Center (IDC) Herzliya, 46101 Israel

[email protected]

mailto:[email protected]:[email protected]:[email protected]

1

We would like to thank John Hauser and Oded Netzer for valuable comments and suggestions on

earlier drafts of this paper.

Using Online Preference Measurement to Infer

Offline Purchase Behavior

Abstract

Most preference-elicitation methods that are used to design products and predict market shares

(such as conjoint analysis) ask respondents to evaluate product descriptions, mostly online.

However, many of these products are then sold offline. In this paper we ask how well preference-

elicitation studies conducted online perform when predicting offline consumer evaluation. To

that end, we conduct two within-subject conjoint studies, one online and one with physical

products offline. We find that the weights of the product attributes (partworths) are different in

the online and offline studies, and that these differences might be considerable.

We propose a model that captures this change in weights and derive an estimator for offline

parameters based on the individual respondent’s online parameter, and for population-level

parameters. We demonstrate that such augmentation of online conjoint data with offline data

leads to significant improvement in both individual prediction and estimation of population-level

parameters. We also ask respondents to state their uncertainty about product attributes, and we

find that while respondents anticipate some of the attributes whose weights change, they

completely miss others. Thus this bias might not be accurately detected through an online study.

1

Introduction

In 2013, online market research accounted for more than 85% of the $10 billion spent on

quantitative research in the US (ESOMAR 2014)1. At the same time, overall online sales were

less than 9% of the $3.2 trillion in total US retail sales (Sehgal 2014). Thus, while most

consumer products are sold offline, marketing research is mostly done online. The implicit

assumption is that findings gathered from online marketing research can be used to predict

offline purchasing behavior. If there are systematic differences between preferences elicited

online and offline purchase behavior, then these differences may be consequential when firms

use the results from the research to plan a new product or predict market shares.

There are a number of behavioral reasons why the evaluation of a physical product may

differ from the evaluation of its online description, and thus consumers may assign different

weights to features in the two formats. In general, online and offline channels vary in the types of

information they convey effectively to consumers and in the consumer’s cost of evaluating them.

In this paper, we systematically compare consumers’ online and offline product evaluations

by conducting two within-subject conjoint studies: one online, in which participants evaluate

product descriptions and pictures, and one offline, with physical products. We chose a messenger

bag with fully configurable, discrete features as a product that is well suited for a conjoint study.

We estimated the weights of the product attributes (“partworths”) in a linear compensatory

model, then compared the partworths obtained from the online and offline formats. Our main

results are the following:

1 Qualitative research is still done almost exclusively offline: Of the $3 billion in qualitative research performed in

the US, 99% is done offline, of which the vast majority is focus groups (ESOMAR 2014).

2

Of the ten partworths parameters estimated, eight changed significantly from the online to

offline studies.

We propose a method of correcting for this online/offline discrepancy, which is based on

maximizing the conditional likelihood of the task of interest (offline), conditioned on data

collected from a different task (online). We show that supplementing online conjoint data

with offline data leads to significant improvement in both individual prediction and

estimation of population-level parameters.

When asked about their uncertainty regarding product attributes, respondents anticipated

some of the attributes whose weights changed, while completely missing others. Therefore,

the bias cannot be corrected or even accurately detected through an online study.

Taking into account the difference between the firm’s online preference elicitation and

offline purchasing behavior of its customers is important for several reasons: The first and most

obvious one concerns marketing research such as product development or predicting market

shares of products that will be sold offline. Research firms prefer online research since offline

conjoint study is costly (as it might require making physical prototypes) and time consuming (as

it requires bringing respondents to offline locations). We demonstrate that supplementing a large

online conjoint study with data from a smaller group of respondents who complete both an

online and offline study will give approximately the same level of accuracy as a large (and

costly) offline study. The data from the smaller group will allow a correction of the online/offline

discrepancies, which can then be applied to the large group.

Aside from the potentially misleading predictions generated by online marketing research,

the discrepancy between online and offline consumer choice behavior has implications for mixed

as well as online retailers, such as Warby Parker and Zappos. Even when shopping online, many

consumers engage in “researcher shopping”, that is, evaluating the product in a brick-and-mortar

store before purchasing online (Neslin and Shankar 2009; Verhoef, Neslin, and Vroomen 2007).

3

For these research shoppers, this discrepancy remains since they would likely use their offline

evaluations (partworths). For mixed and online retailers whose consumers make a purchase

decision online but ultimately evaluate and decide to keep the product based on physical

evaluation upon receiving it, the discrepancy between the two evaluations can lead to increased

product returns (Dzyabura and Jagabathula 2014). Understanding the discrepancy will allow the

retailer to better control for returns of purchased products.

The paper is organized as follows: in the next section we discuss the background of online

and offline preference elicitation, and the following section describes the conjoint studies we

conducted, followed by the model and results. In the subsequent section we propose a correction,

the Inter-task Conditional Likelihood method, and show that it leads to better out-of-sample

prediction of the offline evaluation than simply using the online data. We discuss the value of

using stated uncertainty, and conclude with implications of our study.

Literature Review

Researchers have developed various methods for estimating consumer preferences based

on conjoint studies that ask respondents to rate, rank, or choose among several product “profiles”

or descriptions of the product’s attributes. In a review article, Netzer et al. (2008) present a

framework for looking into recent contributions to this important marketing research tool: (1) the

problem to address; (2) the data collection approach; (3) the estimation of a preference model

and its conversion into action. In this context, our effort is directed at the latter two components

of the framework: data collection and the estimation (or correction) of the preference model. In

these two areas, existing work proposes better data collection and estimation techniques to

improve the reliability of data collected, as well as estimates of parameters. These new

4

techniques include adaptive designs to help avoid respondent fatigue by reducing the number of

questions (e.g. Toubia, Hauser and Simester 2004; Dzyabura and Hauser 2011); incentive

compatibility to motivate the participants and improve the validity of responses (e.g. Ding

2007); Bayesian methods to better account for respondent heterogeneity (e.g. Allenby Arora and

Ginter 1995); inclusion of subjective attributes (Luo, Kannan, and Ratchford 2008); and

incorporating non-compensatory decision rules (Gilbride and Allenby 2004; Yee et al. 2007;

Hauser et al. 2010).

Several papers introduced the idea of supplementing conjoint estimation with additional

data that is external to the conjoint study to improve the quality of parameter estimates,

especially with respect to Bayesian estimation (Yang, Toubia and de Jong 2015, Gilbride, Lenk,

and Brazzel 2008; Luo, Kannan, and Ratchford 2008; Netzer et al. 2008; Feit, Beltramo, and

Feinberg 2010; Bradlow 2005; Sandor and Wedel 2005, 2001). Marshall and Bradlow (2002)

provide a Bayesian approach to combining conjoint data with another data source. In their

approach, the latter data source is used to form a prior distribution. They demonstrate their

approach by using respondents’ self-explicated utility weights to form the prior. Along the same

lines, Dzyabura and Hauser (2011) use a product configurator in conjunction with previous

survey data to form priors for an adaptive non-compensatory preference-elicitation method. A

similar approach is taken by Gilbride, Lenk, and Brazzel (2008), who in a choice-based conjoint

framework argue that for Bayesian estimation, external market shares could be used in order to

compute the prior distribution for the parameters. The current research follows in the same vein,

with the offline study providing an additional data source to supplement the online conjoint data.

One issue that comes up when supplementing conjoint data with data from another source is how

much weight should be given to the two sources. In our case, the weight given to the offline data

5

is proportional to the variance/covariance between the online and offline parameters. For our

purpose, the offline study serves as “external” data, and its weight is proportional to the

variance/covariance between the external and internal data.

We contribute to the field of quantitative preference measurement by proposing a method

to improve the validity of online conjoint studies in predicting offline purchase behavior. We are

the first to systematically investigate the role of the medium on conjoint estimates. The vast

majority of conjoint studies are done on the computer with descriptions of hypothetical products’

attributes (e.g. Allenby, Arora and Ginter 1995; Ding 2007; Evgeniou, Pontil and Toubia 2007;

Jedidi and Zhang 2002; Lenk et al. 1996). While there has been some efforts at making online

conjoint more realistic (Dahan and Srinivasan 2000; Berneburg and Horst 2007), conjoint

literature has not explicitly evaluated whether the typical task format—in which product

descriptions are shown to the consumer as attribute descriptions on the computer—is

representative of the way a respondent would behave if evaluating the physical product with the

same features.

The full description of the conjoint studies is presented next.

Experimental design

In order to systematically compare online and offline product evaluations, we conducted

two within-subject conjoint studies: one online, in which participants evaluate product

descriptions, and the other offline, with physical products. A firm that is considering launching a

new product or a new version of an existing one could use this framework at a prelaunch stage

with prototypes.

6

Product:

The choice of the “right” product is important since we wish to have a product that is

configurable, with discrete attributes, and with just the right price so that the subjects would pay

full attention to their choice. Timbuk2 messenger bags were chosen for the following reasons:

(1) they vary on discrete features, some of which are “touch and feel” features for which we

might expect to see discrepancy between online and offline evaluations; (2) they are fully

configurable, which allowed us to purchase bags with the aim of creating a balanced orthogonal

design for the physical conjoint; (3) they are in the right price range, such that they are expensive

enough for participants to take the decision seriously, but cheap enough that undergraduate

students might be interested in purchasing them; (4) they are infrequently purchased, such that

we can expect that many participants would not be familiar with some of the attributes and not

have well-formed preferences; and finally (5) they are physically small enough for us to be able

to conduct the study in the behavioral lab.

Attributes:

Timbuk2’s website offers a full customization option that includes a number of features

(http://www.timbuk2.com/customizer). We selected a subset of attributes that we expected to be

relevant to the target population and for which there is likely to be some uncertainty on the part

of consumers and respondents. For example, we excluded the Right- or Left-Handed Strap

option since respondents would not have any uncertainty with respect to being left- or right-

handed. In addition, we combined the five color features into one Exterior Design feature that

has four options. To make the study manageable we reduced the number of levels of some of the

features. We therefore have the following six attributes for the study:

- Exterior design (4 options): Black, Blue, Reflective, Colorful - Size (2 options): Small (10 x 19 x 14 in), Large (12 x 22 x 15 in)

http://www.timbuk2.com/customizer

7

- Price (4 levels): $120, $140, $160, $180 - Strap pad (2 options): Yes, No - Water bottle pocket (2 options): Yes, No - Interior compartments (3 options): Empty bucket with no dividers, Divider for files,

Padded laptop compartment

Since we chose the price variable to be continuous, we have a total of 13 discrete attribute

levels for the rest of the attributes. Since we set the default for the dummy variables to zero

(black color, small size, no strap pad, no water bottle pocket, and empty bucket), we’re left with

10 parameters to be estimated (8 discrete, one continuous, and a constant). Using the D-optimal

study design criterion (Kuhfeld, Tobias and Garratt 1994; Huber and Zwerina 1996), we selected

a 20-product design that has a D-efficiency of 0.97.

Participants:

We recruited 122 participants from a university subject pool where respondents signed up

for an individual time slot. Because one of the two studies involved looking at physical bags,

only one person could participate at a time in order to avoid preferences affecting each other.

Incentives:

To ensure incentive compatibility and promote honest responses, participants were told

by the experimenter that they would be entered in a raffle for a chance to win a free messenger

bag. Were they to win, their prize would be a bag that was configured to their preferences, which

the researchers would infer from the responses they provided in the study. This chance of

winning a bag provides incentive to participants to take the task seriously and respond truthfully

with respect to their preferences (Ding 2007, Toubia et al. 2012). We followed the instructions

used by Ding et al. (2011) and told participants that, were they to win, they would be given a

messenger bag plus cash, which together would be valued at $180. The cash component was

intended to eliminate any incentive for the participants to provide higher ratings for more

8

expensive items, in order to win a more expensive prize. Respondents were paid $7 to complete a

30-minute study, plus their chance to win an incentive-aligned prize discussed above. All 122

participants completed the study, that is, completion rate was 100%, not unreasonable for a lab

study.

Conjoint task:

We used a ratings-based task in which respondents rated each bag on a 5-point scale

(Definitely not buy; Probably not buy; May or may not buy; Probably buy; Definitely buy). We

chose a ratings-based task rather than a choice-based task because the latter is much more

complex logistically with physical products, in a study already complex to the participants. Even

when conducting a conjoint study online, choice tasks take as much or more time than ratings

tasks (Huber, Ariely and Fischer 2002; Orme, Alpert and Christensen 1997), and produce less

information than individual product rating tasks (Moore 2004). Conducting choice tasks offline

would be even more time consuming as the experimenter would have to present the respondent

with a set of bags, ask the respondent to choose, then present another set of bags, and so on. For

a comprehensive comparison of ratings and choice based conjoint analysis models, see Moore

2004.

Online task:

The online task was conducted using Sawtooth Software. The first screens walked the

participants through the feature descriptions one by one. After that, respondents were shown a

practice rating question and were informed that this is for practice and their response to the

question would be discarded. The following screens presented a single product configuration,

along with the 5-point scale, and one additional question that was used for another study. An

example screen shot is shown in Figure 1a. Participants could go back to previous screens if they

9

wanted but could not skip a question. Lastly, participants were asked to rate each of the 13

features with respect to what degree they felt they would need to examine a product with this

feature to be able to evaluate it. This was measured on a sliding scale marked “Definitely do not

need to see” to “Definitely need to see”, which correspond to 0 and 100, respectively.

Figure 1a: Sample online conjoint screen shot

Figure 1b: Offline task room setup

10

Offline task:

The offline task was conducted in a room separate from the computer lab in which the

online task had been conducted to ensure that participants could not see the bags while

completing the online task. This task was done individually, one respondent at a time in the

room, so as to avoid a contagion effect. The bags were laid out on a conference table, each with a

card next to it displaying a corresponding number (indexing the item), and the bags were

arranged in the order 1 through 20 (see Figure 1b). The prices were displayed on stickers on

Timbuk2 price tags attached to each bag. The experimenter walked the respondents through all

the features, showing each one on a sample bag.

Model

In order to investigate whether participants’ preferences differ with the online and offline

formats, we allow the partworths to vary by respondent, feature, and format. We use the

following standard specification2 for each individual’s rating of each product in each format:

(1) ∑ ,

where is the rating provided by participant to bag in task format (online or offline);

is the partworth assigned by participant to feature j in task format ; and is the intercept.

Product k is captured by its (J) attribute levels where all the attributes are coded as binary

dummy variables except for the continuous price variables. To capture consumer heterogeneity,

we fit a linear mixed effects (LME) model to the ratings data. That is, we assume that a

respondent’s individual partworths are drawn from a multivariate normal distribution:

2 For example, Green and Srinivasan 1990, Huber 1997, Huber, Ariely and Fischer 2002, Kalish and Nelson 1991.

11

( )

[ ]

To allow for heterogeneity among consumers, we have to estimate the elements of the main

diagonal of , which correspond to , capturing the population variance of partworths of each

feature. Because a key construct in this paper is the correlation between a respondent’s online

and offline partworth for the same feature, we also estimate ( ) for all j.

Since the full matrix is of an order of magnitude of J2

(400 in our case), and since we do not

expect a correlation among different features, we fix at zero the elements of that correspond to

( ) for . Thus we assume that the covariance matrix has

the following structure:

[

]

(

)

(

)

(

( )

( )

( )

)

.

We estimate the LME in equation (1) using maximum likelihood, and use these estimates

for the remainder of the paper.3 The estimates of all features’ fixed effects, (that is, the

population average feature partworths) are reported in Table 1. The estimates of the population

3 Note that while choice based conjoint traditionally requires more complex methods, such as MCMC to estimate the

choice models, ratings tasks can be estimated using classical methods.

12

partworth variance, , and online-offline correlations, ( )

, are reported in

Table 2.

Table 1: Mean population partworths ( )

Attribute

Level

Online Partworth

Offline Partworth

Difference

Exterior design Reflective -0.31** -0.60** -0.28*

Colorful -1.06** -0.71** 0.36**

Blue -0.22** -0.11 -0.12

Black

Size Large 0.27** -0.31** -0.58**

Small

Price $120, $140, $160, $180 -0.22** -0.15** 0.06**

Strap pad Yes 0.51** 0.25** -0.26**

No

Water bottle pocket Yes 0.45** 0.17** -0.28**

No

Interior compartments Divider for files 0.41** 0.52** 0.11

Crater laptop sleeve 0.62** 0.88** 0.26**

Empty bucket/no dividers

Intercept 3.72** 3.39** -0.33

**p

13

Because the partworth of one level of each attribute is normalized to zero, these values

always represent a comparison to the default level. For example, the negative values of the

exterior designs signify that, at the population level, Black is the preferred design.

To appreciate the magnitude of these differences, we calculated the willingness to pay for

the attributes using the methodology of Ofek and Srinivasan (2002). The resultant median

willingness to pay for Strap pad is $43 online and $31 offline; For Water bottle pocket the WTP

is $40 online and exactly half ($20) offline. These represent considerable differences if the firm

is to base its pricing on these findings.

Large population standard deviations signify a great deal of heterogeneity among the

respondents in their preference for the attribute. For example, we can see that there is large

variation in respondents’ preference for Colorful, while there is a relative consensus on Strap

Pad. Also note that the preferences for Reflective and Colorful are more heterogeneous offline

than online. The value of the correlation is a measure of how systematic the bias is. If the

correlation is high, it suggests that if there is an online/offline discrepancy, it is systematic across

respondents. In the extreme case, every respondent’s online partworth estimate would differ from

its offline counterpart by a constant. If the covariance is low, then a respondent’s online

partworth is not a good predictor of her offline partworth.

Our first main result is that the population-level estimates of most features differ by task

format and some are large, suggesting a systematic bias that is being introduced by using online

preference elicitation. This is a major issue if the aim is to make predictions in the offline

environment based on online market research. Both aggregate-level predictions such as market

14

shares and individual predictions such as segmentation or targeting that are based on online

preference elicitation would be incorrect.

Several attributes’ partworths changes are worth noting:

- The single attribute that did not change significantly from the online to offline

scenario is the color Blue. This is likely because the Blue color can be accurately

evaluated based on the image provided in the online task, and Color is a very salient

attribute in both conditions.

- The decrease of the partworths of Water bottle pocket and Strap pad is substantial.

This may be attributed to those attributes being made more salient in the online

condition, when they are stated verbally; offline, on the other hand, they may be

overlooked by participants altogether.

- Only one attribute, Size, changed sign. Thus online respondents preferred the larger

bag but changed their preference to the smaller one once they physically examined

the bags.

- The fact that the intercept’s value does not change implies that there is no feature that

changes upon physical examination that is common to all the bags, such as the

material used.

Since our first main result indicates a substantial online discrepancy in the estimates of the

parameters of interest, we next propose a method to correct for this bias to improve predictions

about consumer offline purchase behavior.

15

Improving predictions of offline purchase behavior

When conducting marketing research with the purpose of product design or market share

prediction, firms typically conduct online conjoint studies with large representative samples of

participants, but with the intent of predicting purchase behavior that will take place offline. Our

results in the previous section provide strong evidence of a discrepancy in partworths

measurement between online and offline evaluations. Consequently, if the product is sold

primarily in brick-and-mortar stores, an online only conjoint study is not sufficient and an offline

conjoint task is required to obtain more accurate predictions. However, conducting large offline

conjoint studies is costly because it involves evaluation of physical products as opposed to online

descriptions. There are few exceptions such as Luo, Kannan and Ratchford 2008, and She and

Macdonald 2013.

We propose to address the above challenge by (a) supplementing a large online conjoint

study with a sample of respondents who complete both the online and offline tasks and (b)

designing a correction that we term the Inter-task Conditional Likelihood correction (ICL),

which uses the supplemented data to improve the offline purchase behavior of the respondents.

We will trade-off the accuracy of our predictions with the cost of data collection by asking a

small number of the respondents from a large conjoint task to complete both the online and

offline conjoint tasks. The number of respondents chosen will allow us to determine the trade

off. The structure of the resulting data set is illustrated in Figure 2, where the shaded area

corresponds to the observed data and the un-shaded area corresponds to the missing data that are

of interest.

16

Figure 2: Data split into estimation and prediction

Training Sample (50%) Hold-out Sample (50%)

Online data used for estimation

Offline prediction task

We use the data from the respondents who completed both the online and offline conjoint

tasks to infer the correlations between the online and offline partworths. We then use these

correlations to infer the missing offline ratings for the other respondents. In order to carry out the

inference, we design the ICL correction, based on maximizing the conditional likelihood of the

task of interest (offline), conditioned on data collected from a different task (online). It is

designed to exploit the correlations that exist between the online and offline partworths, and

therefore, its success depends on the extent of the correlation. We illustrate the ICL method on

the data from the conjoint task described above. Our results demonstrate that the ICL method

works well for (1) the prediction of individual-level offline ratings and (2) the accurate

estimation of the population preference distribution. A key result from our experiments is that

collecting offline and online data for a small number of respondents can result in substantial

improvements in the prediction accuracies.

(1) Individual-level offline ratings prediction

We illustrate our method in the context of the Timbuk2 conjoint study described above.

Because we have collected within-subject data on both the online and offline ratings, we are able

to hold out the offline ratings for a subset of respondents (validation set) and to predict them by

applying the ICL to their online data. To that end, we hold out the offline ratings for a subset of

17

our sample, , and predict them using the model trained on just these respondents’

online data. We use the remainder of the sample, , to train our model of online and

offline partworths, detailed below, which results in an individual-level estimate of the offline

partworth, ̂ , for respondent i and attribute j. We demonstrate that the proposed ICL

correction significantly improves out-of-sample prediction.

We consider two prediction tasks: (a) offline ratings for the 20 bags and (b) individual-

level offline partworths. Depending on the application, one or the other prediction task may be

more relevant. For instance, if the goal were to predict market shares, then the ability to

accurately predict offline ratings is more valuable, whereas if the goal were to segment the

market based on consumer preferences, then accurately predicting partworths is of greater value.

We quantify the prediction gains from our techniques in terms of the popular root mean square

error (RMSE) metric. We report the RMSE metrics for the two prediction tasks: for

the offline ratings prediction task and for the partworths prediction task. The

metrics are defined as follows:

∑ √∑( ̂ )

| |

∑ √∑( ̂ )

| |

where is the set of respondents used for validation, and ̂ is the predicted

offline rating of respondent i for bag k computed according to the individual-level estimate of

the offline partworth, ̂ , for respondent i and attribute j.

18

Table 3 compares the performance of the ICL method against three different benchmarks,

described below. The reported metrics were averaged over 50 random partitions of the data set

into 50% training and 50% validation sets to remove any data partitioning effects. We first

describe the three different benchmark methods and then describe the ICL method in detail.

Table 3: Out-of-sample predictive performance (lower is better)

ICL Online data Partial offline data Full offline data

1.04 1.56 1.18 0.51

0.07 0.21 0.24 0

Full offline data: The first benchmark method provides a lower bound (or the best

performance) on the RMSE metric we can expect. For each individual in the validation set, we

set ̂ and ̂ ∑ , where

is obtained by running ordinary least squares (OLS) method on the offline ratings data. It can be

seen that and is the lowest RMSE we can get with a linear

model. This method uses data not part of our training sample, but provides us with the best

performance we can hope to achieve. Table 3 reports for this method.

Online only data. This method captures the current practice: regardless of the channel in which

the product is to be sold (offline or online), the research is conducted online. For each

respondent, we predict her offline bag ratings using partworths trained on her online conjoint

task., i.e., we use ̂ This results in predicted offline ratings ̂ ,

which are computed according to online estimated partworths :

̂ ∑

19

Because this method corresponds to the way conjoint studies are typically conducted, it is a

reasonable baseline model. We note from Table 3 that the RMSE metrics of this method are

and – significantly higher than the lower bound

from the full offline data. Further, it is clear from the results in Table 3 that we can get

significant improvements – 33% and 67% for ratings and partworths predictions respectively –

over the current practice from incorporating offline data through the ICL method. These

improvements quantify the benefits of our method over the current practice.

Partial offline data: For this method, we used only the offline ratings in the training data.

Specifically, we trained a linear mixed effects model as described in the section above but only

on the offline ratings collected from the participants in the training data. We then used the

estimated population-level parameters as our estimates for the individual-level offline partworths

i.e.., ̂ . These partworths were then used to predict the offline ratings for the

participants in the test data set as follows:

̂ ∑

Note that because we are not using the online observations of the participants, our

partworth and ratings predictions for each product are the same across all the participants. We

note from Table 3 that the RMSE metrics are and .

We observe that the RMSE for the ratings prediction is significantly (about 24%) lower when

compared to the current practice (the online only data benchmark). This finding is indeed very

surprising because it suggests that predicting offline ratings using population-level parameters

and no individual-level information can, on average, be more accurate than actually asking the

participants for their online ratings! This finding reveals the level of discrepancy that exists

20

between the online and offline partworths in the messenger bag setting. Further, it suggests that

the firm may in fact obtain a more accurate understanding of customer’s offline purchase

behavior from a small offline conjoint rather than only a large online conjoint.

We also note that despite outperforming the current practice, the RMSE metrics are still

higher – about 12% and 71% for ratings and partworths predictions respectively – than the ICL

correction method, suggesting that a combination of online and offline conjoint data can

outperform either only offline or online data.

ICL correction: The ICL method computes the expected offline ratings conditioned on all the

observed data. We exploit the properties of multivariate normal distributions in order to compute

the conditional expectations in closed form. Specifically, recall that we assume that each

respondent samples the online and offline partworths , and , jointly from a

bivariate normal distribution:

( )

[

] [

] [ ( )

(

) ]

where ( ) is the covariance between the online and offline partworths of

attribute j.

We use observed data to determine the maximum likelihood estimates of the population-

level parameters , , and ( ) for each

attribute j. Note that the data from the group of respondents who completed both the online and

offline tasks allows us to infer the covariance parameters. Given the population-level parameters,

21

we can show that the conditional distribution of given is a normal

distribution, with mean | and variance | that are given by

|

( )

| √

where is the correlation between the online and offline partworths of attribute j.

Note that | is also the maximum likelihood estimator of , because

of normality. Therefore, under this model, the maximum likelihood estimates of a respondent’s

offline partworths are given by:

̂

( )

Because we have a closed-form expression for the correction, computing it is straightforward

once the population-level parameters have been estimated. Using the ICL correction on our data,

we obtain the of 1.04, and is 0.07. These are substantial

improvements over the uncorrected methods.

(2) Estimation of population-level parameters

We now investigate the ability of our proposed method to obtain accurate estimates of

population-level parameters. As above, we compare the performance of the ICL method to the

three different benchmarks. Population-level partworth parameters allow us to gain an intuitive

understanding of the population’s preference for a certain attribute (e.g., Allenby et al. 2014).

We assume that individual partworth vectors are drawn from a multivariate normal distribution,

22

as described above. Under this assumption, we determine the maximum likelihood estimates of

the population mean and variance of the partworth distribution for each attribute for each of the

three different benchmark methods and the ICL correction method. We quantify the difference in

their performances using the Kullback-Leibler (KL) divergence metric (Kullback and Leibler

1951). Specifically, using the full offline data, we compute the ground-truth estimates of the

mean and variance parameters. We then compute the KL divergence between the distributions

estimated using each of the methods and the ground-truth distributions obtained from the full

offline data. The KL-divergence metrics are reported in Table 4. As is clear from the table, the

estimates resulting from the ICL are close to the ground-truth full offline data estimates,

consistently across all the attributes lending more support to our proposition of supplementing

online data with the offline data. Particularly, the estimates obtained by ICL significantly

outperform those obtained from online only, on all attributes except for the color Blue.

Since the offline conjoint task is challenging and costly to conduct, we explore whether it

can be avoided altogether. Specifically, consumers might have intuitions about which attribute

preferences have a greater possibility of changing from the online to the offline environment. We

tackle this issue next.

Table 4: Population-level parameters – offline, corrected, and online estimates

Attribute

Level

Offline estimates ICL corrected estimates Online only estimates

Mean Variance Mean Variance

KL

(offline,

ICL) Mean Variance

KL

(offline,

online)

Exterior design Reflective -0.6 0.7 -0.61 0.72 0.010 -0.31 0.29 0.146

Colorful -0.71 0.91 -0.7 1.03 0.007 -1.06 0.88 0.073

Blue -0.11 0.25 -0.11 0.16 0.043

-0.22 0.16 0.042

Black

Size Large -0.31 0.28 -0.3 0.31 0.022 0.27 0.18 1.021

Small

Price $120- $180 -0.15 0 -0.16 0 0.081 -0.22 0 0.421

Strap pad Yes 0.25 0.15 0.27 0.14 0.011 0.51 0.11 0.212

No

Water bottle pocket Yes 0.17 0.05 0.15 0.05 0.068 0.45 0.07 0.619

No

Interior compartments Divider for files 0.52 0.04 0.48 0.05 0.065

0.41 0.06 0.149

Crater laptop sleeve 0.88 0.2 0.83 0.31 0.018 0.62 0.25 0.143

Empty bucket

Intercept 3.55 3.44 3.72

Stated Uncertainty

It is possible that consumers are aware that they cannot judge the value of certain attributes

with accuracy. If the online/offline discrepancy occurs to people many times in different

categories, it is possible that consumers have learned to anticipate changing personal preferences.

In that case, we can improve our decision making by asking consumers to self-state the need to

examine each attribute physically in order to accurately judge it. Note that this consumer belief

uncertainty is different from magnitude of the variance around the parameter estimate, which

represents the researcher uncertainty.

In the online task, after rating all the bags, participants were asked to state their certainty

about how well they could judge each feature from the online description. The exact wording of

the question was: “Some of the bag features may be clear to you simply from the description

provided online. Other features you may want to physically examine before making your final

decision. Please rate the following features on how useful it would be for you to examine a

product with this feature in a store.” Each feature was then listed, with a sliding scale ranging

from “Definitely don’t need to see in store” to “Definitely need to see in store”, which

corresponded to uncertainty ratings of 0 and 100, respectively. Price was excluded. While these

scales are not an objective measure of variance and the quantities reported should not be

interpreted in isolation, their relative values are meaningful. Comparing stated uncertainty with

the (absolute) difference between online and offline partworths, we find that stated uncertainty is

a rather poor predictor of the changes that occur. Population averages for the stated uncertainties

are given in Table 5, along with the corresponding features’ online-offline partworth differences.

Note that uncertainty was measured for all attribute levels, while partworths were normalized to

zero for one of the attribute levels.

25

It is clear that while participants can anticipate some of the attributes that will change, such

as Size and the Laptop sleeve, they miss others, such as Water bottle pocket, Strap pad, and

Colorful. Indeed, the correlation of the stated uncertainty and the absolute value of the difference

(for the features for which we have the difference) is rather low (36.8%). Moreover, this

correlation is driven almost exclusively by size: If we compute the correlation of all attributes

excluding size, the resultant correlation disappears ( ).

Table 5: Stated uncertainty and online/offline discrepancies

Attribute Level

Uncertainty

Pre-evaluation

Difference in

Partworths

Exterior design Reflective 49.6 -0.30

Colorful 49.5 0.36

Blue 42.4 0.08

Black 43.5

Small Large 75.7 -0.54

Small 75.4

Strap pad Yes 58.5 -0.29

No 30.3

Water bottle pocket Yes 45.6 -0.29

No 24.4

Interior compartments Divider for files 63.0 0.14

Crater laptop sleeve 70.2 0.25

Empty bucket/no dividers 30.9

Conclusions and Implications

In this work we challenged the implicit assumption commonly made in market research

that findings collected from online research can be used to accurately predict offline behavior.

Consumers’ product evaluations from an online conjoint study with verbal product descriptions

as well as pictures were compared to an offline study with physical products. We found that the

vast majority of partworth parameters changed significantly from online to offline studies. To

26

correct for this disparity, we offered a method based on maximizing the likelihood of the offline

task, conditional on data collected from the online task. We showed that this estimator leads to

better out-of-sample prediction than using uncorrected online data.

In this paper we used primary data in order to carefully control for all factors and zero in on

the online/offline distinction. But the higher-level problem of predicting a consumer’s offline

preferences, given the same consumer’s online preferences, and other consumers’ online and

offline preferences has implications beyond online preference elicitation. Consider an online

retailer, such as Warby Parker, Zappos, or Bonobos. When consumers purchase from these

retailers, they decide what to order based on their online evaluation of the available items.

However, once they receive their order, they determine what they want to keep based on physical

evaluation. These and other online retailers typically have a very generous returns policy, so that

customers may try on several items before purchasing one. Warby Parker even offers a free

“Home Try-On” program in which customers may order several eyeglass frames to try at home,

return all, and then order the prescription lens to go with the chosen frames. Thus, the firm has

some data on both online and offline preferences for customers who have a history with Warby

Parker. When a potential new customer (who has not yet evaluated the firm’s products

physically) orders some items, the firm knows only the online behavior. In a sense the retailer

has more information than the single consumer. In addition to this website visitor’s online-

preference data, the retailer can use for estimation all the information gathered from current

customers, including online and offline preferences, to obtain an estimate on the offline

preference of the new customer. Note that the data scheme given in Figure 3 is similar in nature

to the one used in the data split in Figure 2, since the data used for estimation and prediction are

broken along the exact same lines.

27

Figure 3: Schematic data available to a typical online retailer

Existing customers New customers

Online data used for estimation

Offline prediction task

In the area of recommendation systems, a common problem structure is that a substantial

amounts of data are available on some customers, while only very sparse data are available on

others, and using the former to improve prediction about the latter. In our case, the missing piece

is the offline product evaluation. As we have demonstrated, relying only on a customer’s online

preferences to make predictions about his or her offline preferences may be unreliable. However,

having access to both types of preferences for the existing set of customers enables the retailer to

make a better prediction. An online retailer, through programs such as Warby Parker’s Home

Try-On Program, may implement an offline recommendation system by including some

suggested items for a user who is ordering online.

28

References

Allenby, Greg M., Neeraj Arora, and James L. Ginter. 1995. Incorporating prior knowledge into

the analysis of conjoint studies. Journal of Marketing Research 35 (3) 152-162.

Allenby, Greg M., Jeff D. Brazell, John R. Howell, and Peter E. Rossi. 2014. Economic

valuation of product features. Quantitative Marketing and Economics 12 (4) 421-456.

Berneburg, Alma and Bruno Horst. 2007. 3D vs. traditional conjoint analysis: Which stimulus

performs best? Marketing Theory and Application, AMA Winter Educators’ Conference,

Vol. 18, Chicago, IL 112-120.

Bradlow, Eric T. 2005. Current issues and a ‘wish list’ for conjoint analysis. Applied Stochastic

Models in Business and Industry 21 (4‐5) 319-323.

Dahan, Ely and V. Srinivasan. 2000. The predictive power of Internet-based product concept

testing using visual depiction and animation. Journal of Product Innovation Management

17 (2) 99-109.

Ding, Min. 2007. An incentive-aligned mechanism for conjoint analysis. Journal of Marketing

Research 44 (2) 214-223.

Ding, Min, John R. Hauser, Songting Dong, Daria Dzyabura, Zhilin Yang, Chenting Su, and

Steven P. Gaskin. 2011. Unstructured direct elicitation of decision rules. Journal of

Marketing Research 48 (1) 116-127.

Dzyabura, Daria, and Srikanth Jagabathula. 2015. Offline assortment optimization in the

presence of an online channel. Working paper, New York University.

Dzyabura, Daria, and John R. Hauser. 2011. Active machine learning for consideration

heuristics. Marketing Science 30 (5) 801-819.

ESOMAR. 2014. Global Market Research Industry Report. ESOMAR, Amsterdam, The

Netherlands.

Evgeniou, Theodoros, Massimiliano Pontil, and Olivier Toubia. 2007. A convex optimization

approach to modeling consumer heterogeneity in conjoint estimation. Marketing Science 26

(6) 805-818.

Feit, Eleanor. M., Mark A. Beltramo and Feinberg, Fred. M. 2010. Reality check: Combining

choice experiments with market data to estimate the importance of product attributes.

Management Science 56 (5) 785-800.

Gilbride, Timothy J., and Greg M. Allenby. 2004. A choice model with conjunctive, disjunctive,

and compensatory screening rules. Marketing Science 23 (3) 391-406.

Gilbride, Timothy J., Peter J. Lenk, and Jeff D. Brazell. 2008. Market share constraints and the

loss function in choice-based conjoint analysis. Marketing Science 27 (6) 995-1011.

Green, Paul E., and Venkat Srinivasan. 1990. Conjoint analysis in marketing: new developments

with implications for research and practice. The Journal of Marketing 54 (4) 3-19.

Hauser, John R., Olivier Toubia, Theodoros Evgeniou, Rene Befurt, and Daria Dzyabura. 2010.

Disjunctions of conjunctions, cognitive simplicity, and consideration sets. Journal of


29

Huber, Joel. 1997. What we have learned from 20 years of conjoint research: When to use self-

explicated, graded pairs, full profiles or choice experiments. Sawtooth Software Research

Paper Series.

Huber, Joel, Dan Ariely, and Gregory Fischer. 2002. Expressing preferences in a principal-agent

task: A comparison of choice, rating, and matching. Organizational Behavior and Human

Decision Processes 87 (1) 66-90.

Huber, Joel, and Klaus Zwerina. 1996. The importance of utility balance in efficient choice

designs. Journal of Marketing Research 33 (3) 307-317.

Jedidi, Kamel, and Z. John Zhang. 2002. Augmenting conjoint analysis to estimate consumer

reservation price. Management Science 48 (10) 1350-1368.

Kalish, Shlomo and Paul Nelson. 1991. A comparison of ranking, rating and reservation price

measurement in conjoint analysis. Marketing Letters 2 (4) 327-335.

Kuhfeld, Warren F., Randall D. Tobias, and Mark Garratt. 1994. Efficient Experimental Design

with Marketing Research Applications. Journal of Marketing Research 31 (4) 545-557.

Kullback, Solomon, and Richard A. Leibler. 1951. On information and sufficiency. The Annals

of Mathematical Statistics, 22 (1),79-86.

Lenk, Peter J., Wayne S. DeSarbo, Paul E. Green, and Martin R. Young. 1996. Hierarchical

Bayes conjoint analysis: recovery of partworths heterogeneity from reduced experimental

designs. Marketing Science 15 (2) 173-191.

Luo, Lan, P. K. Kannan, and Brian T. Ratchford. 2008. Incorporating subjective characteristics

in product design and evaluations. Journal of Marketing Research 45 (2) 182-194.

Marshall, Pablo, and Eric T. Bradlow. 2002. A unified approach to conjoint analysis

models. Journal of the American Statistical Association 97 (459) 674-682.

Moore, William L. 2004. A cross-validity comparison of rating-based and choice-based conjoint

analysis models. International Journal of Research in Marketing 21 (3) 299-312.

Neslin, Scott A., and Venkatesh Shankar. 2009. Key issues in multichannel customer

management: current knowledge and future directions. Journal of Interactive Marketing 23

(1) 70-81.

Netzer, Oded, Olivier Toubia, Eric T. Bradlow, Ely Dahan, Theodoros Evgeniou, Fred M.

Feinberg, and Eleanor M. Feit. 2008. Beyond conjoint analysis: Advances in preference

measurement. Marketing Letters 19 (3-4) 337-354.

Ofek, Elie and V. Srinivasan. 2002. How much does the market value an improvement in a

product attribute? Marketing Science 21 (4) 398-411.

Orme, Bryan K., Mark I. Alpert, and Ethan Christensen. 1997. Assessing the validity of conjoint

analysis–continued." Sawtooth Software Conference Proceedings.

Sándor, Zsolt, and Michel Wedel. 2001. Designing conjoint choice experiments using managers’

prior beliefs. Journal of Marketing Research 38 (4) 430-444.

Sándor, Zsolt, and Michel Wedel. 2005. Heterogeneous conjoint choice designs. Journal of


30

Sehgal, Vikram. 2014. Forrester Research Online Retail Forecast, 2013 to 2018 (US). Forrester

Research. https://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2013+To+2018+US/fulltext/-/E-RES115941

She, Jinjuan and Erin F. MacDonald. 2013. Trigger features on prototypes increase preference

for sustainability. ASME 2013 International Design Engineering Technical Conferences

and Computers and Information in Engineering Conference. American Society of

Mechanical Engineers, V005T06A043-V005T06A054.

Toubia, Olivier, Martijn G. de Jong, Daniel Stiegr and Johan Füller. 2012. Measuring consumer

preferences using conjoint poker. Marketing Science 31 (1) 138-156.

Toubia, Olivier, John R. Hauser, and Duncan I. Simester. 2004. Polyhedral methods for adaptive

choice-based conjoint analysis. Journal of Marketing Research 41 (1) 116-131.

Verhoef, Peter C., Scott A. Neslin, and Björn Vroomen. 2007. Multichannel customer

management: Understanding the research-shopper phenomenon. International Journal of

Research in Marketing 24 (2) 129-148.

Yang, Liu, Olivier Toubia and Martijn G. de Jong. 2015. A bounded rationality model of

information search and choice in preference measurement. Journal of Marketing Research

52 (2) 166-183.

Yee, Michael, Ely Dahan, John R. Hauser, and James Orlin. 2007. Greedoid-based

noncompensatory inference. Marketing Science 26 (4) 532-549.
https://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2013+To+2018+US/fulltext/-/E-RES115941https://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2013+To+2018+US/fulltext/-/E-RES115941

Documents

Using Online Preference Measurement to Infer Offline Purchase …people.stern.nyu.edu/ddzyabur/index_files/Online... · 2015. 5. 6. · Using Online Preference Measurement to Infer