Product and Package Testing

Embed Size (px)

Citation preview

  • 7/31/2019 Product and Package Testing

    1/56

    Product and Package Testing

    Howard R. Moskowitz, Ph.D.

    Moskowitz Jacobs Inc.1025 Westchester Avenue

    White Plains, New York 10604

    The author wishes to acknowledge the assistance of Margaret Mirabile in the preparation ofthis article for publication.

    Part 1 Product Testing Overview

    Objectives of Product Testing

    In the world of market research, product testing occupies a venerable place. Manufacturers

    need to know how their products perform. In the main, product testing methods have beendeveloped and advanced, as well as used by researchers involved in fast moving consumergoods (e.g., beverages, shaving creams, etc.). The objectives of the research have variedfrom measuring basic acceptance (is the product good or bad, accept Vs reject), toproviding consumer feedback for ongoing formulation, to concept-product fit, and even toestimate market share or expected volume. The many uses of product testing mean thatthere is no one discipline or set of knowledge which encompasses the world of evaluationthat we call product testing. The stage at which the test is commissioned, the knowledgebase of the researcher doing the test, and the use of the data all dictate different types oftests.

    The Different "Players" In the Product Testing WorldMarket researchers are only one group of professionals involved in the evaluation ofproducts. When looking at the professionals involved and the technical literature cited, thereader might come across disciplines as different as product developers, sensory analysis(R&D oriented), quality control (production oriented), consultants to management (processoriented, for the development process), market researchers, advertising agencies,statisticians, and many more. For health and beauty-aidproducts, one might also cross knives with perfumers. In durables, one might also meet upwith engineers(Gacula1993; Meilgaard, Civille and Carr 1987; Moskowitz, 1983;1984; 1985; 1994;Stone and Sidel

    1985).Each of the players in the product-testing world tests for a specific reason. As wewill see below, the different players in turn bring to bear their predilections, biases (aboutwhat data should be, what respondents should do, and even language).

    What Type of Data Does The Researcher Present The Difference Between Early

    Stage Guidance And Late Stage Confirmation

    Most researchers in the market research industry are familiar with late-stage producttesting. In late-stage testing, most of the developmental work has finished. R&D has

  • 7/31/2019 Product and Package Testing

    2/56

    probably finished creating its prototypes, tested these among consumers (more about thatbelow), and arrived at one or two potentially acceptable prototypes. Criteria for selectingthese prototypes for a confirmatory test may include success in earlier stage testing. Theproducts may be promising because they are not distinguishable from the current goldstandard (e.g., in the case of cost reduction, where the goal is to maintain product quality

    with cheaper ingredients), or the hope is that the product fits a marketing driven concept.

    Late stage product testing usually does not call for action beyond a pass/fail (although failmay call for additional work). Rather, the late stage product test provides the necessaryinformation to confirm that R&D have developed an appropriate product. In commonparlance, late stage (confirmatory) product testing acts as a disaster check (although it isunclear how a potentially disastrous product, with low acceptance, could have even reachedthis stage of testing).

    In contrast to late stage confirmatory tests are early stage development studies. Variouslycalled research guidance tests, R&D hall tests, and the like, these early stage tests serve

    the purpose of guiding R&D (research and development). Typically these tests arecommissioned in order to screen winning products from among a wide variety ofcontenders (for future work), identify the characteristics of a product that drive acceptance(for brand learning), to segment consumers on the basis of sensory preference (for targeteddevelopment / marketing), etc. Early stage tests may be run by R&D, but often are run bymarket researchers as well. Early stage tests encompass a far wider range of questions andapplications than do late stage confirmatory tests. Typically, these early stage tests providea plethora of information that could not otherwise be obtained.

    Types of Problems to Be Answered

    The key to correct (or better appropriate) product tests is to identify a problem, and then

    to match the right test method to the problem. As simple as this seems, this dogma isviolated as often as it is fulfilled. For instance, it is not unusual for researchers to use pairedcomparisons tests (head to head preference tests) in order to guide product developers, eventhough the data does not really help the developers who receive the report. Nor is itparticularly unusual for researchers to rely upon in-house expert panels who select productsbased upon liking or preference, in order to go forward with these products for furtherdevelopment despite the fact the in-house expert panelists are not the relevant consumersof the product, and their preferences may be irrelevant to the population at large .

    Product testing can address many different types of problems, depending upon the agendaand needs of researchers. Many of those who commission product tests simply want sometype of single number e.g., significance (is my product better than a competitor),scoreboard rating (how high do I score). Others using product testing want to learn aboutthe product (e.g. what drives liking, are there segments in the population with differentpreferences, along what particular attributes is the product strong or the product weak).Typically, these different questions come from different audiences that commission theresearch.

    Part 2 Product Testing Specifics

  • 7/31/2019 Product and Package Testing

    3/56

    Affective Tests How Much Do I Like This Product?

    Traditionally, most product tests have been commissioned to answer questions aboutliking. These tests may comprise paired comparisons (in which one product is pairedagainst a standard, such as the market leader). In other cases, the researcher will use a scaleof liking (e.g., the 9 point hedonic scale, varying from dislike extremely to like extremely,

    Peryam and Pilgrim 1957). Still, other researchers use appropriateness as a measure ofacceptance, rather than degree of liking (Schutz 1989). Paired comparisons are often usedwhen the marketing objective is to beat or at least to equal a specific product. [Marketersthink in terms of performance in comparison to a competitor or to a gold standard]. Scalesof liking are used when the researcher wants to determine the degree of acceptance, oreven, whether or not the product is liked at all (viz., classification). Table 1 presentsvarious scales for liking that have been used by researchers (Meiselman 1978).

    Table 1

    Verbal descriptors for hedonic scales

    ScalePoints

    Descriptors

    2 Dislike, unfamiliar

    3 Acceptable, dislike, (not tried)

    3 Like a lot, dislike, do not know

    3 Well liked, indifferent, disliked (seldom if ever used)

    5 Like very, like moderately, neutral, dislike moderately, dislike very

    5 Very good, good, moderate, tolerate, dislike (never tried)

    5 Very good, good, moderate, dislike, tolerate

    9 Like extremely, like very much, like moderately, like slightly, neither likenor dislike, dislike slightly, dislike moderately, dislike very much, dislikeextremely

    9 FACT Scale (Schutz 1964): Eat every opportunity, eat very often, frequently

    eat, eat now and then, eat if available, dont like eat on occasion, hardlyever eat, eat if no other choice, eat if forced

    Liking can be further refined in these tests by having the respondent assess the differentcomponents of the product as they appear to the senses (e.g., like appearance,fragrance/aroma, taste/flavor, texture/feel, etc.). Quite often, these attribute liking ratingscorrelate highly with overall liking,suggesting that respondents may have a hard timedifferentiating overall liking from attribute liking. [Respondents do differentiate the

  • 7/31/2019 Product and Package Testing

    4/56

    different senses, however, in terms of scaling amount of an attribute, rather than liking ofan attribute, as we will see below].

    It is worth noting that quite often researchers use these affective tests along with actionstandards. An action standard dictates that the product will move forward in the

    development or marketing process (e.g., to further development or even to marketintroduction), if the product meets specific objectives (e.g., achieves a certain minimumacceptance level).

    Paired Preference Tests

    In consumer research, the role of the paired preference test appears to be sacrosanct,whether deserved or not. Paired preference simply means putting one product up againstanother, and instructing the respondent to indicate which one the respondent prefers. Thepreference measure does not show how much one product is liked (or how much moreone product is liked versus another). Rather, the paired preference test is simply a head tohead match, with winner take all (on at least a person by person basis). The results of

    paired preference tests are reported in percents, rather than in degree of liking (as would bereported in a scaling exercise).

    Paired preferences tests can extend to other attributes because the researcher can also askthe respondent to indicate which product has more of a specific characteristic. Thecharacteristic need not even be a sensory one (such as depth of color, thickness, graininessof texture, etc.). Rather, the characteristic could even be an image one (e.g., one product ismore masculine).

    Paired preference tests, popular as they are, provide relatively little information. Theconventional (yet unproven) wisdom is that consumers make choices by implicitly

    comparing products to each other. Thus, the typical paired test pits a new product againsteither the gold standard that it is to replace, or against the market leader (and only themarket leader) against which it is thought to compete.

    The positives of paired preference testing are the simplicity of the test in the fieldexecution, and the ease of understanding (viz., a paired comparison result). The negativesare that the demand for head-to-head comparison may focus attention onto small, irrelevantdifferences that exist (especially when the respondent is searching for a hook on which tohang the comparison), and that the data cannot be used for much beyond the comparisonresults. Paired comparison data are not particularly useful for product development becausethey give no real guidance.

    It is worthwhile digressing for a moment here to understand a little of the intellectualhistory of paired testing, because of the widespread use of the methods, and the limitations(which do not appear to affect the use or misuse of the procedure). Paired testing got itsstart more than a century ago. The German psychologist, physiologist, and philosopher,Gustav Theodor Fechner (Boring 1929), was interested in the measurement of sensoryperception. However, according to Fechner, the human being is not able to act as a simplemeasuring instrument in the way that we understand these instruments to operate. [Todays

  • 7/31/2019 Product and Package Testing

    5/56

    researchers, especially those in experimental psychology and psychophysics, wouldvehemently disagree with Fechner, but keep in mind that were dealing with the start ofsubjective measurement, not with its well developed world today]. According to Fechner,one way to measure sensory perception was to ask people to discriminate between samples.From their behavior Fechner was able to determine the magnitude of difference needed

    between two samples to generate a difference. The psychometrician, L.L. Thurstone (1927)carried this analysis one step further by developing paired comparison methods, in whichrespondents judged which sample was stronger, heavier, liked more, etc. From the pairedcomparison data (the ancestor of our paired preference tests), Thurstone developed asubjective scale of magnitude. Thurstones scaling methods were soon adopted byresearchers to erect scales of sensory magnitude and liking, from which developed thelong-standing acceptance of paired methods in applied product testing. Thurstone,publishing in the academic literature, was thus able to influence subsequent generations ofapplied market researchers, many of whom do not know the intellectual origins of this test.

    Sensory Questions How Much Of A Characteristic Does My Product Possess?

    Respondents can act as judges of the amount of a characteristic. Sometimes thesecharacteristics or attributes can be quite simple e.g., the sweetness of a beverage. Othertimes the attributes may be more complex, such as the tomato flavor, which calls intoplay a host of sensory attributes. Still, other times, the attributes may be simple, but requireexplanation (e.g., the ginger burn of a ginger candy).

    There is an ongoing debate in the scientific literature (especially fragrance, but also food)about the degree to which a respondent can validly rate sensory characteristics. On one sideof this dispute are those who believe that the only attribute that a consumer can validlyevaluate is liking. These individuals believe that it is improper to have consumers ratesensory attributes. On the other side of the dispute, are those (including this author) who

    believe that a well instructed respondent can validly rate sensory attributes, and indeedsuch a respondent (not an expert, mind you) can switch focus from liking to sensory tosensory directional (see below), for many attributes.

    The dispute is simple enough to solve through experimentation. In cases where experts andconsumers rate the same products it has been shown that their ratings correlate with eachother (Moskowitz 1995), and with known physical variations of the product (Moskowitzand Krieger 1998). The scientific literature suggests that consumers can validly ratesensory attributes, and that their attribute ratings line up with known physical changes inthe stimulus. Literally thousands of scientific articles in all manner of disciplines related tosensory perception suggest that the unpracticed respondent can assign numbers whosemagnitude matches the physical magnitude of the stimulus (see Stevens 1975). One couldask for no clearer validation of a respondents abilities.

    Sensory Directionals or "Just Right" Information What To Do

    Quite often in developmental research a key question is whats wrong with this product (ifanything), and what does the consumer feel that we should do to correct this problem.When a respondent is asked to evaluate problems, the typical question is known as asensory directional. The question may be phrased somewhat as follows: "Please describe

  • 7/31/2019 Product and Package Testing

    6/56

    this product: 1=far too dry 5 = Perfect on dryness/wetness 9=far too wet". Note thatthe respondent is assumed to know both the ideal level of dryness/wetness, and the degreeto which the product deviates from that ideal.

    Respondents often find this type of question fun to answer, because the question allows

    them to become experts by telling R&D product developers what to do. Sensorydirectionals are surprisingly on target for visual attributes (viz., the respondent does knowthe optimal or ideal level of darkness or stripes), usually on target for texture, butsometimes on target, other times off target for taste/flavor. The directions are generally offtarget for certain types of emotion-laden attributes such as real chocolate flavor, perhapsbecause these attributes are hedonic attributes (liking), disguised as sensory attributes. Nochocolate ever has enough real chocolate flavor. Even if the developer were to use a lot ofchocolate flavoring, the product would still taste too bitter.

    Typically, product developers use rules of thumb by which they translate these just rightscales to direction. Thus when a respondent says that a beverage lacks real fruit flavor,

    the product developer knows that often the respondent means that the product is not sweetenough or, that changing the amount of sugar will change the fruit flavor.

    Where Do Sensory and Other Product Test Attributes Come From?

    Since much of product testing involves finding a way to have consumers communicatewith product developers, it is important that the questionnaire cover the key sensoryattributes of the product. A glance at different questionnaires will reveal a remarkablerange of attributes, from the very general (overall appearance, aroma, taste, flavor, texture),down to the very specific (e.g., amount of pepper flavor, even amount of black pepperflavor). Some questionnaires are filled with specifics; other questionnaires appear to beprobing the surface, without much depth in the quality of information being generated.

    Table 2 shows an example of an attribute list for beer (Clapperton, Dagliesh and Meilgaard1975).

    Table 2

    A comprehensive attribute list for the sensory characteristics of beer.

    First Tier Term Second Tier Term

    1 Spicy

    2 Alcoholic Warming,Vinous

    3 Solvent like Plastic like, Can liner, Acetone

    4 Estery Isoamyl acetate, Ethyl hexanoate, Ethyl acetate

    5 Fruity Citrus, apple, banana, black currant, melon, pear, raspberry, strawberry

    6 Floral Phenylethanol, geraniol, perfumy, vanilla

  • 7/31/2019 Product and Package Testing

    7/56

    7 Acetaldehyde

    8 Nutty Walnut, coconut, beany, almond

    9 Resinous Woody

    10 Hoppy

    11 Grassy

    12 Straw-like

    13 Grainy Corn grits, mealy

    14 Malty

    15 Worty

    16 Caramel Primings, syrupy, molasses

    17 Burnt Licorice, bread-crust, roast-barley, smoky

    18 Medicinal Carbolic, chlorophenol, iodoform, tarry, bakelite

    19 Diacetyl Buttery

    20 Fatty acid Soapy/fatty, caprylic, cheesy, isovaleric, buytric

    21 Oily Vegetable oil, mineral oil

    22 Rancid Rancid oil

    23 Fishy Amine, shellfish

    24 Sulfitic

    25 Sulfidic H2S, mercaptan, garlic, lightstruck, autolyzed, burnt rubber

    26 Cookedvegetable

    Parsnip/celery, dimethyl sulfide, cooked cabbage, cooked sweet-corn, tomatoketchup, cooked onion

    27 Yeasty Meaty

    28 Ribest Black currant leaves, catty,

    29 Papery

    30 Leathery

  • 7/31/2019 Product and Package Testing

    8/56

    31 Moldy Earthy, musty

    32 Sweet Honey, jammy, over-sweet

    33 Salty

    34 Acidic Acetic, sour

    35 Bitter

    36 Metallic

    37 Astringent

    38 Powdery

    39 Carbonation Flat, gassy

    40 Body Watery, characterless, satiating

    The novice researcher asked to create a product questionnaire often feels at a loss.Generally, there is no fixed list of attributes. In many corporations there may be a so-called"laundry list" of attributes that have been used over the years by various researchersinvolved in the product category. All too often, the laundry list is a legacy list comprisingso many attributes as to be daunting and formidable. Yet, many practitioners in producttesting are loathe to dispense with any attributes lest these attributes somehow later on bediscovered to be critical to decision making.

    Besides legacy lists, or laundry lists, many practitioners use focus groups to elicit consumerrelevant attributes. The focus groups are set up with consumers who first discuss theproduct, and then sample the product, all the while rating attributes relevant to the product.For many products that are used at home (e.g., laundry, personal care, food prepared at thestove) the respondents may be given products ahead of the focus group, asked to use theproducts, and then discuss the attributes in the group session. The number of attributesemerging from this exercise can be enormous, requiring that at the end of the exercise theresearcher pare down this large number to a workable few that can be used in thequestionnaire.

    Besides legacy lists and focus groups, in recent years there have sprung up other, moregeneral lists of attributes. These lists, appearing in a variety of publications (e.g., Civilleand Lyon 1995) provide a start for the researcher. Unfortunately, many of these attributelists cover such a wide range of characteristics that only a few attributes are relevant. Theresearcher, confronted with the task, then simply uses judgment, typically extracting one ortwo promising attributes from the list going back to the legacy lists, and perhaps runs afocus group. In the end, however, the list chosen is a personal one, dictated by the specificproduct, the experience and insight of the researcher, and the momentary needs forinformation.

  • 7/31/2019 Product and Package Testing

    9/56

    Scales Whats The Right Metric To Measure Subjective Responses?

    Over the years researchers have fought with each other, often passionately, about theappropriate scales to use. We saw above that there are a variety of scales to measure liking.We are talking here of a more profound difference in scaling the nature of the scale, andthe allowable transformations. The standard nine point hedonic scale (Peryam and Pilgrim

    1957) is an example of a category scale. Category scales are assumed to be interval scales the differences between adjacent category points is assumed to be equal up and down thescale. However, there is no fixed zero point. Interval scales (like Fahrenheit or centigrade)allow the researcher to do many statistical analyses, such as calculating the mean, thestandard deviation, perform T tests, regression, etc. The interval scale has no fixed zero,however, so that one cannot calculate ratios of scale values. On a nine-point scale of likingwe cannot say that the liking of 8 is twice as much liking as the liking of 4. Other scales,such as ratio scales (with a fixed zero) do allow the researcher to make these ratio claims.Weaker scales, such as ordinal scales (viz., rank order these products in degree of liking),just show an order of merit. One cannot even talk about the differences being equalbetween ranks 1 and 2 versus ranks 2 and 3.

    In the end, most researchers end up with the scale that they find easiest to use. Thecategory scale is by far the most widely used because it is simple and can be anchored atboth ends. [Sometimes the researcher anchors every point in the scale as well]. Manyacademic researchers use ratio scales to study the magnitude of perception, and from timeto time ratio scales have been used in the commercial realm of product testing. Manyresearchers in business often use rank order or ordinal scaling. They have to be careful notto interpret differences in ranks as reflecting the magnitude of differences in subjectivemagnitude.

    Product to Concept Fit

    Beyond simply measuring acceptance or sensory attributes, researchers want to discoverwhether or not a product and a concept fit together. The concept sets the expectations forthe product. When a consumer buys a product there are often some expectations about theappearance, the taste, the texture of the product, along with non-sensory expectations (e.g.,aspirational expectations, such as sophisticated, etc.). Products can be tested within theframework set up by these expectations in the product-concept test.

    The actual test execution is quite simple. In one variation the respondent reads theconcepts; an opportunity to form expectations is given. The respondent then evaluates oneor several products. For each product the respondent rates the degree to which the productdelivers what the concept promises. The scale is also simple either the product deliverstoo little, just right, or too much of what the concept promises. The scale sounds easyenough to use, and in actuality it is easy (at least respondent say that they have noproblems).

    One of the key issues to keep emerging in the evaluation of product/concept fit is whetheror not respondents really have an idea of what a product should taste like, smell like,perform like. If the concept promises a specific flavor (e.g., an Italian flavor), or a specificperformance (e.g., 366 MHz processor in a computer), then respondents can easily

  • 7/31/2019 Product and Package Testing

    10/56

    ascertain whether or not the product lives up to the concept. For many aspirationalconcepts, however, such as those developed for perfumes, it is difficult, if not impossible,to truly show that a product and concept agree with each other. The respondent may try toarticulate the reasons for the concept/product fit, but the reasons will be different for eachperson.

    Base Size How Many Respondents Are Enough?

    The issue of base size continues to appear in product testing, perhaps because base sizemore than any other factor influences the cost of a project. The greater the number ofrespondents in a study (viz., the larger the base size), the more comfortable the researchershould feel about the results presuming, of course, that the respondents participating arethe correct respondents. This has led to recommendations or at least rules of thumbdictating 100+ respondents for conventional product tests, but far more (e.g., 300+) forclaims substantiation tests (Smithies and Buchanan 1990).

    Base size issues are not as simple as they might appear. The statistician would aver that the

    reason for the large base size is that it reduces the uncertainty of the mean. The standarderror of the mean, a measure of expected variation of the mean, were the study to berepeated, drops down with the square root of the number of respondents. That is, one feelsmore comfortable about obtaining the same mean the next time if one uses a large enoughbase size in the first place. Matters can get out of hand, of course, if this requirement foruncertainty reduction in itself demands a base size so high that the test is unaffordable.[Many novice researchers, in fact, often become so fearful about the uncertainty that theyrefuse to do studies unless there is an enormous base size, preferring rather to do qualitativeresearch instead].

    Another way to look at base size considers the stability of the mean rating with an

    increasing number of respondents. The first rating is, of course, the most influential ratingin determining the average, for it is the only data point. Each additional rating affects theaverage less and less (viz. by 1/n, where n=number of ratings). The author has shown thatthe mean stabilizes at about 50 ratings, whether the attribute deals with liking, or withsensory judgments, and that this number, 50, appears to hold even when the data comprisesratings from a homogeneous population of individuals with the same preference pattern(Moskowitz 1997). Therefore, it appears that the researcher will not be in particular dangerif the base size exceeds 50. A base size of 100 might be better (or at least appear sopsychologically), but the mean will not change much from 50 ratings to 100 ratings,assuming the sampling rules are maintained for choosing the respondents.

    How Many Products Can A Respondent Validly Taste Or Smell?

    A recurring bone of contention in product tests is the number of samples that a respondentcan evaluate without becoming fatigued. Some of the conventional (but not necessarilycorrect) wisdom avers that the typical respondent can evaluate only two or three samplesbefore fatigue sets in. Scientific research in the senses of taste and smell (where fatigue ismost likely to occur) suggest that the human senses are far more robust, that we canevaluate many more samples than is commonly thought, and that perhaps the reason forfatigue is boredom with the task, rather than some sensory loss. If, in fact the consumer

  • 7/31/2019 Product and Package Testing

    11/56

    respondent knows that the task will last an hour, and has prepared for that eventuality, thenin fact no fatigue sets in, as long as the respondent is motivated, and the samples are spacedout with enough time to rinse the mouth. (See Table 3).

    Table 3Maximum Number Of Products That Should Be Tested (Max), Minimum Waiting Time Between

    Samples (Wait), Pre-Recruit, Four Hour, Central Location Test

    Product Max Wait Comments

    Carbonated soft drink (regular) 15 7 Little adaptation or residual exists

    Pickles 14 10 Garlic and pepper aftertaste build up

    Juice 12 10 Watch out for sugar overload

    Yogurt 12 10

    Coffee 12 10

    Bread 12 4

    Cheese 10 15 Fatty residue on tongue

    Carbonated soft drink (diet) 10 10 After taste can linger

    French Fries 10 10 Fatty residue on tongue

    Cereal (cold) 10 10

    Milk based beverage 10 10

    Soup 10 10 Can become filling

    Sausages 8 15 Fat leads to satiety

    Hamburgers 8 15 Amount ingested has to be watched

    Candy chocolate 8 15

    Croissants 8 10 Fat leads to satiety

    Salsa 8 10 Longer waits necessary for hot salsas

    Cereal (hot) 8 10 Amount ingested has to be watched

    Ice Cream 8 10 Fat leads to satiety

    Mousse 6 15 Combination of fat + sugar yields satiation

    Lasagna 6 10 Amount ingested has to be watched

    Perhaps some of the concern of fatigue can be traced to what is observed to occur in acorporate environment, when team members in product work (marketers, R&D, sensoryanalysts) meet to evaluate many samples and to decide on next steps. The tasting andevaluation is so haphazard that after a few samples most participants remark that theycannot remember the first sample that they tasted. Despite R&Ds best attempts to presentthe samples in a coherent order the team members end up tasting the samples rapidly,

  • 7/31/2019 Product and Package Testing

    12/56

    trying to keep all the information in their head. They would be better off simply rating eachsampling and then tallying the results at the end of the tasting.

    What Is The Best Venue In Which To Test Products?

    Product evaluation can range from the simple testing of one or two snack chips to the

    complex preparation and evaluation of many frozen entrees. Traditionally, marketresearchers have been used to fairly simple types of tests, where the preparation is simple,and the respondent task is correspondingly simple (e.g., which of these two carbonatedbeverages do you prefer; rate each carbonated beverage on these ten attributes, etc.)

    Product tests are done in three different types of venues. The simplest product tests aredone at home. The respondent takes home one (or more) products, uses the products,evaluates the products on attributes, and records the ratings. The interviewer may call therespondent on the phone and record the ratings; the respondent may return with thecompleted ratings (and either be finished with the study or receive a new set of products toevaluate); mail in the ratings, or even punch in the ratings on the Internet.

    Sometimes the researcher wants to have more control over the samples. Control isnecessary when the samples must be tested absolutely "blind" (viz., no hint as to brand hard to do at home, unless the product is repackaged), or when the samples must be testedunder tight control, or prepared appropriately. The interview can be set up as a mallintercept, or as a hall/pre-recruit test. In a mall intercept, the interviewer sets up a teststation somewhere in the mall (e.g., at a store front, in an office right off the mall, or evenat an electronic kiosk). The respondent is intercepted by an interviewer, invited toparticipate for a short interview, lasting 5-30 minutes, and then the study begins. Quiteoften respondents refuse to participate. This is called a refusal, and the refusal rates areincreasing, as shoppers become tired of the interruption. Nonetheless, the mall intercept

    method is frequently used.

    For more elaborate tests, the researcher may hire a hall, or a test facility, and pay therespondent to participate for an extended period. In this extended test the respondentprovides a great deal of information. Of course the respondent may be well compensated toparticipate, but the quality of the data, and the need to test several samples with the samerespondent more than compensates for the extra cost. Table 4 compares home use andcentral location tests

    Table 4

    Central Location Vs Home Use Tests: Advantages Vs Disadvantages

    Aspect Central Location Home Use Test

    Control Over Ingestion And Evaluation High Low

    Test Site Vs Typical Consumption Unnatural Natural

    Amount Of Product To Be Evaluated Limited Unlimited

  • 7/31/2019 Product and Package Testing

    13/56

    Number Of Different Products Tested Many Few

    Measure Satiation and Wearout No Yes

    Mix Many Concepts and Many Products Yes, Easy No, Hard

    Number of panelists/product for stability 20-50 50-100

    Choreographing A Product Test

    Over the past 25 years market researchers specifically, along with sensory analysts, havebegun to test many different products in a single evaluation session. Rather than limitingthe test to a few products (e.g., 3-4 at most), the researchers have recognized the value ofobtaining data on a large set of products, with each respondent evaluating many products.This product testing, done in a pre-recruit central location or at home, obtain data from asingle individual on far more products than traditionally has been the case. With suchextensive data, however, the researcher has been afforded the opportunity to do far more

    complex analyses than were ever considered possible.

    Sometimes the number of products may be as high as 10-15. [Despite what purists say, it isfairly straightforward to evaluate these many products. One need only ensure adequatemotivation and time between samples]. A typical example of choreographed evaluationappears in Table 5. The approach is set up so that the respondent is pre-recruited and paidto participate (increasing or at least maintaining motivation). The session may last severalhours, enabling the evaluations to proceed at a leisurely pace. Most importantly, thechoreographed test session is set up to acquire data from respondents in an optimal fashion.One can almost envision the evaluation session to be a factory, with most of the timebeing spent in acquiring the data. Very little of the interview time is spent briefing the

    respondent, making the respondent feel comfortable, delivering the product, and thenfinishing a session. These activities in a short intercept evaluation may take up 5-10minutes in a 20-minute interview, or 25%-50% of the time. In contrast, in a 120-240minute interview these 5-10 minutes are negligible.

    Table 5

    Choreography of an Extended Product Evaluation Session

    Step Activity Rationale

    1 Study designed to test multipleproducts within an extended one ortwo day session

    Maximizes amount of data, andmotivation of respondent

    2 Respondents recruited to participatefor an extended session

    Must be phone recruited. Typically not aproblem, if the study is run on theevening or on a weekend

  • 7/31/2019 Product and Package Testing

    14/56

    3 Respondent shows up, and the chief interviewer orients the respondent asto the purpose of the study

    Usually 25 people show up for the test.The orientation is important because itsecures cooperation

    4 Interviewer guides respondent

    through first product

    This is slow, because the interview

    guides the group through all theattributes. Respondents unsure at thestart, but have no problem going throughthe evaluation, with an interviewerguiding the group. The products arerandomized across respondents

    5 Respondent completes first product,waits for second product. Whereappropriate an interviewer checksthe data

    Neither the interviewer nor therespondent knows the correct ratings,but this checking maintains interest andmotivation

    6 Respondent waits a specified timeand proceeds with second product

    Waiting, order of products strictlycontrolled

    7 Respondent finishes the evaluations,proceeds to an extensiveclassification

    Data provides information on respondentattitudes, geo-demographics, purchasepatterns, etc.

    8 Respondent paid and dismissed Respondent maintains ongoingmotivation and interest throughout theentire evaluation

    Part 3 Analyses of Product Test Data

    Basic Data: Product x Attribute Profiles

    In the past three decades, researchers have begun to recognize that there is much to belearned by obtaining profiles of multiple products on multiple attributes. These profiles orreport cards are analyzed in a variety of ways. Fundamentally they require the respondentto rate each product on different characteristics, with all of the products rated on the samecharacteristics. The scales are similar. These report cards or matrices enable the researcherto glance quickly through the data set to determine the scores of a particular product on

    different attributes, or, more importantly, the scores of many products on the sameattribute. The ability to compare many products on a scorecard basis creates a deeperunderstanding of the product category, and is far easier to comprehend than a series ofpaired comparisons among all of the products. Respondents may rate the product onsensory, liking, image attributes, or even fit to different concepts. Table 6 shows anexample of such data.

  • 7/31/2019 Product and Package Testing

    15/56

    Table 6

    Example of product x attribute profile. All scales are rated on 0-100.

    Sensory scales: 0=None at all100=extreme

    Liking scales: 0=hate100=love

    Product A B C

    Appearance

    Like appearance 61 54 59

    Brown 57 43 62

    Flecks 59 30 73

    Tomato Pieces/Amount 22 42 64

    Tomato Pieces/Size 18 35 40

    Vegetable Pieces/Size 11 15 38

    Aroma/Flavor

    Like Aroma

    Aroma 45 44 57

    Like Flavor 61 57 18

    Flavor Strength 61 57 76

    Tomato Flavor 55 59 49

    Meat Flavor 13 25 56

    Mushroom Flavor 16 24 36

    Onion Flavor 25 20 36

    Green Pepper Flavor 11 9 30

    Vegetable Flavor 18 18 36

    Herb Flavor 52 36 70

    Black Pepper Flavor 19 14 36

    Garlic Flavor 30 20 34

    Cheese Flavor 11 32 5

    Salt Taste 25 24 23

    Sweet Taste 37 28 19

    Aftertaste 48 44 70

    Sour Taste 26 36 54

    Oily Flavor 26 23 24

    Texture/Mouthfeel

    Like Texture 58 63 60

    Crisp Vegetable Texture 35 34 40

    Oily Mouthfeel 29 24 25

    Thickness 53 41 47

  • 7/31/2019 Product and Package Testing

    16/56

    Drivers of Liking and The Inverted U Shaped Rule for Products

    At a higher level of analysis, these matrices enable the researcher to understand whatcharacteristics drive acceptance (or drive other attributes). In days gone by, the researcher

    might simply correlate all of the attributes with overall liking to determine which particularattribute drives liking. The correlation statistic, assuming as it does a linear relationbetween two variables, often failed to show the importance of sensory attributes as driversof liking. Rather, attributes such as good quality and good taste correlated with overallliking. This is somewhat of a tautology. One approach to understand drivers of liking plotsoverall liking on the ordinate versus attribute liking on the abscissa. The equation thatdescribes this relation is: Overall Liking = A + B (Attribute Liking). The slope, B, indicatesthe relative importance of the attribute. High values of slope B mean that the attribute isimportant unit increases in the liking of the attribute correspond to high increases inoverall liking. Conversely, low values of slope B mean that the attribute is unimportant(Moskowitz and Krieger 1995).

    More recently, it has been the fashion to plot overall liking versus sensory attributes, and fita curve to the data, shown in Figure 1 (Moskowitz 1981). The curve, a quadratic function,shows that liking first increases with sensory attribute level, then peaks, and then dropsdown. The specific form of this inverted U shaped curve will depend upon the specificproduct being testing, the sensory range achieved, and the preferences of the respondents.Sometimes the curve will be flat, showing that although respondents differentiate theproduct on a sensory attribute, the sensory differences do not drive liking (at least on a uni-variate basis). Sometimes the curve will show an inverted U shaped curve, other times thecurve will increase and perhaps flatten out slightly as if it were reaching an asymptote. Inother situations the curve will show a downwards sloping relation, suggesting that as the

    sensory attribute increases, liking actually decreases. There is usually quite a pronouncedrelation between flavor/aroma sensory levels and liking, but less of a relation betweentexture or appearance and liking. We know from basic research (Flavor Hedonic Levels)that the chemical senses (taste, smell) provoke more hedonic reactions (like Vs dislike)than do the visual, auditory, and tactile senses. Thus, for consumer goods (whether food orhealth and beauty aids), we should not be surprised to see these strong sensory-likingcurves for the flavor attributes, but not for many of the appearance or texture attributes.

    Figure 1

    How flavor intensity drives overall liking. The researcher creates a scattergram, and

    fits a quadratic function to the data. The quadratic function shows where liking tends

    to peak, and reveals the degree to which liking co-varies with a single sensory

    attribute.

  • 7/31/2019 Product and Package Testing

    17/56

    Sensory Preference Segmentation

    Marketers are the first to recognize that consumers differ from each other. It should comeas no surprise that product test results are often reported by total panel, and by keysubgroups. Traditionally, these key subgroups comprise individuals falling into differentgeo-demographic classifications, into different brand usage patterns (e.g., users of brand XVs brand Y; heavy Vs light users of Brand Z, users versus non-users). Some of these

    segmentations also may be based upon responses to a battery of questions, which dividepeople into groups based upon pre-set classifications (Mitchell 1983). For the most partthese segmentations, whether by brand, by geo-demographics, or by other, overarchingpsychological schemata, do not show particularly striking differences in the consumerresponses to products. That is, in a product test these different subgroups of individualsappear to show quite similar reactions to the products themselves.

    Psychologists interested in the basic science of perception have introduced newer conceptsof segmentation. The fundamental idea behind sensory-based segmentation begins with thepreviously suggested relation - viz., as a sensory attribute increases, liking first increases,peaks (at some optimal sensory level), and then drops down with further increases in

    sensory attribute (Moskowitz, Jacobs and Lazar 1985). It is important to keep in mind thatthe segmentation is based upon the sensory level at which a persons liking rating peaks,and not based upon the magnitude of liking. [A short digression is appropriate here.Segmentation based upon the pattern of liking ratings, rather than relations between likingand sensory attribute, could introduce artifacts. Respondents showing similar sensory-liking patterns may fall into different segments if one respondent assigns high numbers toproducts and the other respondent assigns low numbers to the same products]. Whenapplied to product test results (more than 6 products must be evaluated by each respondent)

  • 7/31/2019 Product and Package Testing

    18/56

    this segmentation, based upon the sensory-liking patterns, reveals that there are often two,three or more groups in the population with distinct sensory preference patterns. That is,each segment shows a different sensory level at where liking peaks. There may beindividuals who like strong flavored products, individuals who like weak flavored, lightercolored products, and the like. (See Figure 2, which shows data for coffee).

    Sensory preference segments transcend geo-demographics, brand usage, and values.Consumers in different segments may use the same brand, have the same values, etc. Thisorganizing principle for product development, emerging from product test results (alongwith emerging from basic psychophysics) is forming the foundation of new approaches forconsumer research and applied product development.

    Figure 2

    Sensory segmentation results from a five-country study of coffee. The left panel

    shows how overall liking for respondents from each of five countries (1-5) is driven by

    perceived bitterness. The right panel shows the same liking-bitterness relation, this

    time using data from sensory segments obtained from the five countries. Note thateach segment is represented in each country, albeit in different proportions.

    Product Mapping and Creating A Product Model From the Map

    At an even higher level of analysis the matrix of product x attribute rating is used to mapproducts into a geometrical space (see Figure 3). The geometrical space may comprise one,two, three or many more dimensions (although it is difficult to visualize a space of morethan three dimensions). Product mapping is done typically with disconnected products,

  • 7/31/2019 Product and Package Testing

    19/56

    usually those current in the product category, but occasionally with prototypes (so calledcategory appraisal, aptly named because the approach considers the full range of productscurrently in a category, without regard to the connectedness of their underlyingformulation; Munoz, Chambers and Hummer 1996).

    The locations of the product in the map are usually based upon the sensory properties of theproducts, with products located close together being sensorially similar to each other.Brand managers and product developers use these brand maps as heuristics to understandhow the consumer perceives the products as a set of interchangeable (or non-interchangeable) items.

    When the coordinates of the brand map are factor scores (from factor analysis), the mapserves double duty. First, the map locates the products in a space as the heuristic tovisualize. Second, the coordinates of the map (viz., the dimensions) act as independentvariables in a regression model. [The map coordinates, being factor scores, areparsimonious and statistically independent so that no violence is done to the data]. As a

    consequence the researcher builds a set of equations (one per attribute, such as liking,image rating, sensory rating] with the independent variables being the factor scores. [Themodels are not just linear equations, but also quadratic equations]. At the end of the day theresearcher is now able to identify a location in the map, and estimate the full sensoryprofile and liking rating corresponding to the location in the map. This type of analysis canonly occur if the researcher tests multiple products (at least six, but preferably 10 or more).Such an approach using mapping and modeling (with factor scores as independentvariables, and all rating attributes as dependent variables) typically would occur early in theresearch process, when the objective is to understand the categories, discover drivers ofliking, and then identify unfilled holes in the product category.

    Figure 3Example of mapping a product category (ointment), with the outcome being at least

    two dimensions (or factors), based upon a principal components of the sensory

    attributes. Products are circles, with the size proportional to liking. Holes in the

  • 7/31/2019 Product and Package Testing

    20/56

    category are locations without any products.

    Product Optimization

    Originally, optimization was done using the relation between liking and sensory attributes(Schutz 1983). This is called R-R or response-response analysis, because the optimizationmodel uses relations between two dependent or response variables (viz., liking is a ratedattribute, as are sensory attributes). More recently, however, attention has focused on thevariables under the developers control (rather than on the sensory-liking relation). This iscalled S-R or stimulus-response analysis, because the optimization model uses the relationbetween variables under the developers control and consumer rated attributes.

    Several decades ago, statisticians developed systematic arrays of variables with theproperty that the variables were independent of each other. These are called experimentaldesigns (Box, Hunter and Hunter 1978; Khuri and Cornell 1987; Plackett and Burman

    1946). Product developers use these designs to create arrays of products which compriseknown variations (e.g., in ingredients or in process conditions). By varying the independentvariables (viz., features under the developers control), and by obtaining reactions (e.g.,consumer liking, cost of goods, yield, sensory attributes and even image attributes), thedeveloper can link the variables under direct operational control to what the consumerresponds. Furthermore, these experimental designs lend themselves to statistical modelingby regression. It is a short step from the experimental design + consumer reaction tomodeling, and another short step from modeling (or equation building) to optimizing (viz.,

  • 7/31/2019 Product and Package Testing

    21/56

    identifying the combination of the variables under design that produces the highest liking,or the highest liking constrained to a given cost, etc.). Table 7 shows two experimentaldesigns one a screening design (allowing the researcher to investigate many differentvariables, each at two levels), and the other a more complex experimental design (called aBox Behnken design). The Box Behnken or response surface design enables the researcher

    to create a quadratic model relating formulation to liking, cost and other attributes, andallowing the researcher to optimize liking or any other variable, subject to explicitconstraints (viz., the formulation must stay within the range tested) and subject to implicitconstraints (viz., one or more attributes of the product, such as a sensory attribute or evencost of goods must remain within specified constraints). Table 8 shows results for theoptimization of a commercial pizza which varied on six formula variables.

    Table 7

    Example of two experimental design systems commonly used in product development.

    The left system (11 variables, A-K, 12 runs) is a Plackett-Burman screening design.

    The design is used to identify key drivers of a response. Each variable (A-K) appears

    at two options (1=new, high, on; 0=old, low, off, respectively). The right system (3variables, A-E in 15 runs, is a central composite design. It is used to identify optimal

    levels of a process or an ingredient).

    PROD A B C D E F G H I J K PROD A B C

    1 1 1 0 1 1 1 0 0 0 1 0 1 3 3 3

    2 1 0 1 1 1 0 0 0 1 0 1 2 3 3 1

    3 0 1 1 1 0 0 0 1 0 1 1 3 3 1 3

    4 1 1 1 0 0 0 1 0 1 1 0 4 3 1 1

    5 1 1 0 0 0 1 0 1 1 0 1 5 1 3 36 1 0 0 0 1 0 1 1 0 1 1 6 1 3 1

    7 0 0 0 1 0 1 1 0 1 1 1 7 1 1 3

    8 0 0 1 0 1 1 0 1 1 1 0 8 1 1 1

    9 0 1 0 1 1 0 1 1 1 0 0 9 3 2 2

    10 1 0 1 1 0 1 1 1 0 0 0 10 1 2 2

    11 0 1 1 0 1 1 1 0 0 0 1 11 2 3 2

    12 0 0 0 0 0 0 0 0 0 0 0 12 2 1 2

    13 2 2 3

    14 2 2 1

    15 2 2 2

    Table 8

    Example of an optimization run (for pizza). Six ingredients were varied systematically

    according to a central composite design. The results led to a product model,

  • 7/31/2019 Product and Package Testing

    22/56

    interrelating formulations (under the developers control) and ratings (acquired by

    the researcher). Data from optimization studies let the researcher quickly identify the

    most promising formulations.

    Optimize Total Total Total Total

    Constraint # 1 None Cost

  • 7/31/2019 Product and Package Testing

    23/56

    the optimum lies further out. The process repeats, but usually the procedure generates anoptimum fairly quickly (viz., in one or two iterations).

    Product optimization is occasionally run by market researchers, occasionally by productdevelopers using small-scale R&D efforts (called Research Guidance Panels). Optimization

    procedures, long in coming to consumer research, provides one of the most promisingtechnologies for improving the usefulness of product tests, because the data generate both adatabase (for future use), as well as immediate, concrete action (to solve the currentproblem in an actionable fashion).

    Emerging Product Test Methods Observation

    Up to now, we have dealt exclusively with paper and pencil (or computer based)questionnaires. Recently, the ethnographic approach to product testing has begun to gainfavor, especially in R&D facilities of larger companies. Ethnography looks at behavior inthe normal environment. Similarly, ethnographic product testing looks at how respondentsuse specific products. Right now the goal of ethnographic product research is to identify

    the ways that products are used, in order to improve them, or to come up with new ideas.Ethnographic research is not typically (but is sometimes) used to evaluate differentproducts. We can trace ethnographic research back to a combination of two sciences anthropology (which looks at the behavior of people in their environment), and humanfactors (a branch of psychology that looks at the behavioral interaction of a person and amachine, or a person and a stimulus). Ethnographic research is only in its infancy forproduct evaluation, but it is a trend that bears watching.

    Emerging Product Test Methods Satiety Evaluation

    Home use evaluation is supposed to capture the real nuances of product use. One area ofevaluation that is coming into play, especially for fast moving consumer goods of different

    sensory characteristics, is known as satiety evaluation. The goal here is to identify thefactors that make consumers stop consuming a product. We know that people stop eatingfoods when they are satiated. We know that there is variety seeking among consumers (van

    Trijp 1995), so that eventually consumers habituate, or get bored with the same soapfragrance, or the same cereal, and buy another fragrance, another cereal, or the sameproduct but in a different flavor. Consumer researchers are now turning their attention toidentifying the factors underlying this boredom, either defined as short term satiation (justcant eat or drink any more), and long term habituation (just got tired of the taste, andwanted to try something new). Like ethnographic methods, satiety and boredom aspects ofproduct testing are just in their infancy. The jury is out as to what will be found, but thesenew methods are part of the evolving growth of product research.

    The Estimation of Volumetrics

    Many marketers use product-concept tests to estimate the amount of a product that theywill sell. Through years of experience, researchers have developed databases and thenmodels showing the relation between concept scores (before product usage), product scores(during usage), and estimated volume. These approaches, expanded by Bases (1999) usemodels that interrelate many of the marketing variables. Concept scores and product scores

  • 7/31/2019 Product and Package Testing

    24/56

    are only two of the variables. Other factors considered are awareness (or some surrogate ofadvertising), and distribution.

    The most widely used volumetric modeling is Bases II (AC Nielsen Bases, 1999). Thissystem assesses the appeal and market potential for products, based upon the reaction of

    consumers to both the concept (first simulating the marketplace reality of advertisingfirst), followed by reaction to the product (again simulating marketplace reality). Bases IIestimates Year 1 sales volume after launch, as well as estimate volume for Years 2 and 3,respectively. In contrast to the methods described above, the market potential analysis callsinto play many factors, not just the product itself.

    These market models differ from the developmental models previously discussed becausethe market models come into play at the end of the development process. There is nosearch for patterns because these studies concentrate on one product, one concept, and theresponse from many individuals. Quite often, however, the widespread popularity of thesetest market simulators have led to their use as de facto product-concept tests. Unlike

    conventional product-concept tests that look only at the fit of the product and concept,volumetric testing incorporates many more aspects of the marketing mix.

    Part 3 Package Design Research

    A Short Overview

    Package research is often the neglected child in consumer testing of price, positioning,product, and package. Until recently, little attention had been given to the product package.Companies might launch new products, and involve their in-house package designprofessionals as well as outside design houses. However, much of the research done was

    either qualitative (focus groups) or none at all. Part of this lack of research stems from theperception that the package is not particularly important it is just a vehicle by which toenclose the product (especially in fast moving consumer goods). Part of the lack ofresearch can be traced to the design houses, which perceive their work as artistic, incapableof judgments. [No design house wants consumers to design the package for them]. As aconsequence, there is a paucity of research literature on package design, although there arescattered articles here and there, and a number of well respected companies specializing inpackage research. [We can compare this dearth of literature to the extremely high volumeof literature on advertising testing, perhaps because many more dollars are spent onadvertising, so it is immediately more important to be right].

    Today, however, with increased competition in many categories, the package is assumingrenewed importance, and researchers are rising to the occasion. Manufacturers aresponsoring short in-house courses on the importance of package design. Package designfirms, once the bastion of the artist, are welcoming quantitative research. [It should beacknowledged that these design firms always welcomed qualitative research, because thatresearch, like qualitative in advertising, probed and revealed aspects of the package thatwere important for the designer]. Package design firms are also adopting a more holisticapproach to the design process. According to researchers at Cheskin (a brand identity and

  • 7/31/2019 Product and Package Testing

    25/56

    design firm), commenting on the design of a beer package (1999), "Beyond helping youmanage clients, research can actually help you create better design not by dictating colorand form, but by informing your intuitionWhen your job is to make a product sell itselfat the point of sale, how do you know that your design will deliver the goods?...What doesthe package say about the beer inside?How should your client position the product?If

    you just look at the packages, what would you say about the beer?"

    What Should Package Testing Provide?

    Package testing serves a number of purposes. The most important purpose for packagetesting is to confirm or disconfirm the objectives of the package designer. Typicallypackage designers create packages (either graphics and/or structures) with some objectivein mind such as reinforcing the brand, communicating new benefits, enhancing thechances that the product will be selected. In all of these objectives package testing mustprovide some idea as to whether or not the new package is successful. A modicum ofsensitivity to the creative process is also in order, since package testing often reflects on thecreative abilities of the designer who is both an artist and a business-responsive

    individual.

    Focus Groups

    For many years, the conventional package evaluation consisted of focus groups. Focusgroups, properly conducted, can be very valuable in the creative process, especially in theup-front developmental stage. The focus group is private, does not come out with hard andfast results, and can cover a great deal of territory in a warm, informal manner. In apackage design focus group the respondent can verbalize reactions to package features, canidentify key features that elicit interest, and can talk about the coherence between thepackage itself and the brand. A sensitive package designer gets a great deal out of focusgroups because the designer can see and hear how the consumer reacts to the packages. By

    presenting different possible packages, the designer can see which ones work and whichdo not. Focus groups provide the designer with a great deal of feedback, generally in a non-threatening, non-judgmental manner (viz., not judged by another professional, althoughconsumers could reject the package). It should come as no wonder that this type ofqualitative research has been welcomed by designers, because in essence it reflects howthey would intuitively go about obtaining feedback about their creations.

    Focus groups can, however, backfire in several ways. First, they can be misused. In a focusgroup people say many different things, and the listener can select specific phrases to suithis/her own purpose and agenda. Second, and even more importantly, respondents oftenlike to play designer. Respondents like to tell designers what to do even if the designersarent really interested in following the sage advice. (Glass 1999). Consequently, there maydevelop an antagonism between client/respondent (with the client wanting to follow therespondents suggestions) and the designer (who has a basic artistic and business idea inmind and simply wants feedback).

    Profiling Packages

    At the simplest level the researcher can determine whether or not the package is acceptableto the consumer, or fits the brand. This type of testing typically involves attitudinal

  • 7/31/2019 Product and Package Testing

    26/56

    measures. The researcher shows the package to the respondent, and obtains a profile ofratings, similar to the way that the researcher obtains product ratings. The key differencesare that the researcher may obtain profiles of the expectation of the product (based uponthe package), as well as ratings of the package itself. The attributes can vary substantiallyfrom one product category to another. Some of the ratings may deal with interest in the

    package (or in the product, based upon exposure to the package). Other ratings may dealwith ones expectation of the product based upon the package. If the respondent actuallyuses the product in the package, then the researcher can obtain ratings of person-packageinteraction (ranging from ease of carrying or gripping the product; ease of opening; ease ofremoving the product; ease of storing the package, etc.). This type of information isextremely valuable to the package designer, who wants to find out whether or not thepackage is on target.

    Typically the designer receives only the simplest of 'briefs or project description, such asthe requirement that the package be upscale, that it live up to the brand, and that itcommunicate an effective or good tasting product. The data from the evaluative package

    tests provide the diagnostics to demonstrate whether or not the package as designedactually lives up to the requirements set by the package design group at the client.

    Behavioral Measures T-Scope and Speed of Recognition

    At a more behavioral level the package testing may involve behavioral measures. Onebehavioral measure is the ability to identify whether or not a package is actually on theshelf, during a very short exposure time. The rationale behind this type of testing (called T-Scope or tachistoscope testing) is that the typical shopper spends relatively little timeinspecting a store shelf. In order for a package to make its impact it is important that thepackage jump off the shelf (visually). The researcher assumes that those packages that areperceived in the short interval permitted by the T-Scope have shelf presence. In some

    demonstrations, Elliot Young (1999) has demonstrated that in these T-scope tasks thespeed at which the stimulus information is available is often so low that the respondent hasto use color and design cues, rather than brand names. Well-known brands with distinctivepackage features (e.g., Tide brand detergent, with its concentric halos) are rapidlyrecognized, even if the brand name is incorrect. If the research interest focuses on therecognizability of the single package, then one can present single packages and determinethe fastest shutter speed (viz. the least time needed) for the package to be correctlyidentified. If the research interest focuses on the findability of the package on the shelf,then the researcher can place the test package at different locations on the shelf and thenparametrically assess the contribution of package design and location on the shelf as jointcontributors to "findability".

    Eye Tracking and The Features of A Package

    Eye tracking is another method for testing the shelf. The typical shelf is an extremelycomplex array of package design over which the consumers eye wanders. The objective ofpackage designers is to guide the consumer to the package. This action then constitutes thefirst step in selecting the product. Eye tracking allows the researcher to identify the patternthat the follows. Is the eye drawn to the clients particular package? Does the eye wanderaway from the package (thus diminishing the chances that the customer will select the

  • 7/31/2019 Product and Package Testing

    27/56

    product)? Young 1999 has also demonstrated the use of eye tracking technology, basedupon military research developments in the 1970s. The eye tracking technology traces thelocation of the line of site. It shows how the consumer, when presented with either a shelfset or a single package, explores the stimulus, tracks how the eye wanders, and recordswhat the eye sees. When done for a single stimulus (e.g., an over the counter medicine), the

    eye tracking technology can show if and when, and for how long key messages are lookedat (but not whether these are good or not). When done for the entire shelf the eye trackingtechnology can identify whether or not the package is even looked at, and for how long.Based upon these analyses, Young and his associates have amassed a variety ofgeneralizations about where on the shelf the package should be, how many facings thepackage should have, etc.

    Optimizing Package Design Can Art and Science Mix?

    One of the most intriguing developments is the advance of conjoint measurement(systematic stimulus variation) into package design. Conjoint measurement comprises theexperimental variation of components in a concept in order to understand the underlying

    dynamics of how the components perform. The respondent evaluates full combinations ofthese components (e.g., benefits, prices, etc.), and from the ratings researchers estimate thepart-worth contribution of each component. The same research paradigm (viz., systematicvariation of components) has begun to enter package research, but this time thecomponents are features of the package (e.g., different names, colors, graphics, etc.), andthe concepts are full packages comprising these systematically varied features. From theresponses of consumers to test packages (e.g., created on the computer screen), theresearcher and package designer quickly discover what every design feature contributes toconsumer interest, communication, etc.

    Some of the power of the conjoint approach applied to graphics design can be seen by

    inspecting Figure 4 (showing template, components, one finished package, and a table ofutilities). The conjoint system enables the designer to rapidly understand the effectivenessof each concept element, and then to create new and better combinations, by incorporatinghigh performing graphic elements, and by discarding poor performing elements. Of courseattention must always be paid to the artistic coherence of the design. Unlike concept testingusing conjoint measurement, however, there is far less in the literature (and in actualpractice) using systematically varied package features. The best explanation for this is thatpackage design is still an artistic endeavor, resisting research on the one hand, and yetwelcoming research insights on the other.

    Figure 4

    Components of package design research. A = categories or sets of features; B =

    specific template used to embed graphic features; C = features in specific designed

    package; D = juice package corresponding to the specific set of designed features

  • 7/31/2019 Product and Package Testing

    28/56

    .

    Package Design An Overview

    In contrast to product (and to concept) development, package research development is stillin its infancy. For the most part, researchers are only beginning to recognize the importance

    of the "package" as a key factor. Up to now it was the artistic aspects of package design,along with the role of the package as a secondary factor that hindered the full flowering ofpackage research. That unhappy situation is beginning to remedy itself as the competitiveenvironment forces recognition of package research as a key new source of information,useful to maintain or to achieve competitive advantage in the marketplace.

    References

    AC Nielsen Bases. (1999), Sales Brochure (from the internet www. Bases.com).

    Boring, E.G., (1929), "Sensation and Perception," in The History Of ExperimentalPsychology, New York: Appleton Century Crofts.

    Box, G.E.P., J.Hunter, and S. Hunter (1978), Statistics For Experimenters. New York: JohnWiley.

  • 7/31/2019 Product and Package Testing

    29/56

    Cheskin, Inc. (1999), "Five Bottles Of Beer," in Critique: The magazine of design graphicthinking, (Spring), 89-95.

    Civille, G., and B.G. Lyon, eds. (1995), FlavLex, Version 1.15 ; West Conshohocken:copyright Softex, Lancaster1, American Society for Testing and Materials.

    Clapperton, J., C.E.Dagliesh and M.C Meilgaard (1975), "Progress TowardsAnInternational System of Beer Flavor Terminology," Master Brewers Association OfAmerica Technical Journal, 12: 273-280.

    Gacula, Jr., M.C. (1993), Design and Analysis of Sensory Optimization. Trumbull: Foodand Nutrition Press.

    Glass, Alan, Personal Communication

    Khuri, A.J., and J.A. Cornell (1987), Response Surfaces. New York: Marcel Dekker Inc.

    Meilgaard, M., G.V.Civille and B.T Carr (1987), Sensory Evaluation Techniques. BocaRaton: CRC Press.

    Meiselman, H.L. (1978), "Scales for Measuring Food Preference," in Encyclopedia of FoodScience, M.S. Petersen and A.H. Johnson, ed. Westport: AVI, 675-678.

    Mitchell, A. (1983), The Nine American Lifestyles. New York: MacMillan.

    Moskowitz, H.R. (1981), "Relative importance of perceptual factors to consumeracceptance: Linear versus quadratic analysis," Journal Of Food Science, 46, 244- 248.

    Moskowitz, H.R. (1983), Product Testing and Sensory Evaluation of Food: Marketing andR&D Approaches. Westport: Food and Nutrition Press.

    Moskowitz, H.R. (1984), Cosmetic Product Testing: A Modern Psychophysical Approach.New York: Marcel Dekker Inc.

    Moskowitz, H.R. (1985), New Directions In Product Testing And Sensory Analysis OfFood. Westport: Food And Nutrition Press.

    Moskowitz, H.R. (1994), Food Concepts and Products: Just In Time Development.

    Trumbull: Food And Nutrition Press.

    Moskowitz, H.R. (1995), "Experts versus Consumers," Journal of Sensory Studies, 1995b,11, 19-35.

    Moskowitz, H.R. (1997), "Base Size in Product Testing: a Psychophysical Viewpoint andAnalysis," Food Overall Quality and Preference, 8, 247-256.

  • 7/31/2019 Product and Package Testing

    30/56

  • 7/31/2019 Product and Package Testing

    31/56

    HOWARD R. MOSKOWITZ

    Howard Moskowitz is president and CEO of Moskowitz Jacobs Inc., a firm he founded in

    1981. Dr. Moskowitz is simultaneously a renowned experimental psychologist in the fieldof psychophysics (the study of perception and its relation to physical stimuli), and aninventor of world class market research technology. Among his important contributions tomarket research is his 1975 introduction of psychophysical scaling and productoptimization for consumer product development. Whereas these methods are standard andwell accepted today, they required a massive culture change in the 1975 businesscommunity. In the 1980's his contributions in sensory analysis were extended to health andbeauty aids. Dr. Moskowitz has also developed and refined procedures which enableresearch to interrelate products, concepts, consumers, experts and physical test instruments,in order to accomplish product optimization and reverse engineering. Finally, his research

    and technology developments have led to concept and package optimization (IdeaMap ),

    integrated and accelerated development (DesignLab )), and the globalization anddemocratization of concept development for small and large companies alike, in an

    affordable, transaction-oriented approach (IdeaMap ) Wizard).

    Dr. Moskowitz graduated Harvard University in 1969 with a Ph.D. in experimentalpsychology. Prior to that he graduated Queens College (New York), Phi Beta Kappa, withdegrees in mathematics and psychology. He has written/edited eleven books, publishedwell over 180 articles, has lectured in the U.S. and abroad, and serves on the editorial boardof major journals. He has won numerous awards, among them the Scientific Director'sGold Medal for outstanding research at the U.S. Army Natick Laboratories. In 1992 Dr.Moskowitz founded a $2,000 prize for young scientists working in the psychophysics of

    taste and smell, administered through the Association of Chemoreception Scientists.

    :none;border-left: solid windowtext 1.5pt;border-bottom:solid windowtext .75pt;border-right: solid windowtext .75pt;mso-border-top-alt:solid windowtext .75pt;padding: 0in 5.4pt0in 5.4pt">

    Measure Satiation and Wearout

    No

    Yes

    Mix Many Concepts and Many Products

    Yes, Easy

    No, Hard

    Number of panelists/product for stability

  • 7/31/2019 Product and Package Testing

    32/56

    20-50

    50-100

    Choreographing A Product Test

    Over the past 25 years market researchers specifically, along with sensory analysts, havebegun to test many different products in a single evaluation session. Rather than limitingthe test to a few products (e.g., 3-4 at most), the researchers have recognized the value ofobtaining data on a large set of products, with each respondent evaluating many products.This product testing, done in a pre-recruit central location or at home, obtain data from a

    single individual on far more products than traditionally has been the case. With suchextensive data, however, the researcher has been afforded the opportunity to do far morecomplex analyses than were ever considered possible.

    Sometimes the number of products may be as high as 10-15. [Despite what purists say, it isfairly straightforward to evaluate these many products. One need only ensure adequatemotivation and time between samples]. A typical example of choreographed evaluationappears in Table 5. The approach is set up so that the respondent is pre-recruited and paidto participate (increasing or at least maintaining motivation). The session may last several

    hours, enabling the evaluations to proceed at a leisurely pace. Most importantly, thechoreographed test session is set up to acquire data from respondents in an optimal fashion.One can almost envision the evaluation session to be a factory, with most of the timebeing spent in acquiring the data. Very little of the interview time is spent briefing therespondent, making the respondent feel comfortable, delivering the product, and thenfinishing a session. These activities in a short intercept evaluation may take up 5-10minutes in a 20-minute interview, or 25%-50% of the time. In contrast, in a 120-240minute interview these 5-10 minutes are negligible.

    Table 5

    Choreography of an Extended Product Evaluation Session

    Step Activity Rationale

    1 Study designed to test multiple Maximizes amount of data, and

  • 7/31/2019 Product and Package Testing

    33/56

    products within an extended one ortwo day session

    motivation of respondent

    2 Respondents recruited to participatefor an extended session

    Must be phone recruited. Typically not aproblem, if the study is run on theevening or on a weekend

    3 Respondent shows up, and the chiefinterviewer orients the respondent asto the purpose of the study

    Usually 25 people show up for the test.The orientation is important because itsecures cooperation

    4 Interviewer guides respondentthrough first product

    This is slow, because the interviewguides the group through all theattributes. Respondents unsure at thestart, but have no problem goingthrough the evaluation, with aninterviewer guiding the group. Theproducts are randomized acrossrespondents

    5 Respondent completes first product,waits for second product. Whereappropriate an interviewer checksthe data

    Neither the interviewer nor therespondent knows the correct ratings,but this checking maintains interest andmotivation

    6 Respondent waits a specified timeand proceeds with second product

    Waiting, order of products strictlycontrolled

    7 Respondent finishes the evaluations,proceeds to an extensiveclassification

    Data provides information onrespondent attitudes, geo-demographics,purchase patterns, etc.

    8 Respondent paid and dismissed Respondent maintains ongoingmotivation and interest throughout the

    entire evaluation

    Part 3 Analyses of Product Test Data

    Basic Data: Product x Attribute Profiles

    In the past three decades, researchers have begun to recognize that there is much to belearned by obtaining profiles of multiple products on multiple attributes. These profiles orreport cards are analyzed in a variety of ways. Fundamentally they require the respondentto rate each product on different characteristics, with all of the products rated on the samecharacteristics. The scales are similar. These report cards or matrices enable the researcher

  • 7/31/2019 Product and Package Testing

    34/56

    to glance quickly through the data set to determine the scores of a particular product ondifferent attributes, or, more importantly, the scores of many products on the sameattribute. The ability to compare many products on a scorecard basis creates a deeperunderstanding of the product category, and is far easier to comprehend than a series ofpaired comparisons among all of the products. Respondents may rate the product on

    sensory, liking, image attributes, or even fit to different concepts. Table 6 shows anexample of such data.

    Table 6

    Example of product x attribute profile. All scales are rated on 0-100.

    Sensory scales: 0=None at all100=extreme

    Liking scales: 0=hate100=love

    Product A B C

    Appearance

    Like appearance 61 54 59

    Brown 57 43 62

    Flecks 59 30 73

    Tomato Pieces/Amount 22 42 64

    Tomato Pieces/Size 18 35 40

    Vegetable Pieces/Size 11 15 38Aroma/Flavor

    Like Aroma

    Aroma 45 44 57

    Like Flavor 61 57 18

    Flavor Strength 61 57 76

    Tomato Flavor 55 59 49

    Meat Flavor 13 25 56

    Mushroom Flavor 16 24 36

    Onion Flavor 25 20 36

    Green Pepper Flavor 11 9 30

    Vegetable Flavor 18 18 36Herb Flavor 52 36 70

    Black Pepper Flavor 19 14 36

    Garlic Flavor 30 20 34

    Cheese Flavor 11 32 5

    Salt Taste 25 24 23

    Sweet Taste 37 28 19

  • 7/31/2019 Product and Package Testing

    35/56

    Aftertaste 48 44 70

    Sour Taste 26 36 54

    Oily Flavor 26 23 24

    Texture/Mouthfeel

    Like Texture 58 63 60

    Crisp Vegetable Texture 35 34 40Oily Mouthfeel 29 24 25

    Thickness 53 41 47

    Drivers of Liking and The Inverted U Shaped Rule for Products

    At a higher level of analysis, these matrices enable the researcher to understand whatcharacteristics drive acceptance (or drive other attributes). In days gone by, the researchermight simply correlate all of the attributes with overall liking to determine which particularattribute drives liking. The correlation statistic, assuming as it does a linear relationbetween two variables, often failed to show the importance of sensory attributes as driversof liking. Rather, attributes such as good quality and good taste correlated with overallliking. This is somewhat of a tautology. One approach to understand drivers of liking plotsoverall liking on the ordinate versus attribute liking on the abscissa. The equation that

    describes this relation is: Overall Liking = A + B (Attribute Liking). The slope, B, indicatesthe relative importance of the attribute. High values of slope B mean that the attribute isimportant unit increases in the liking of the attribute correspond to high increases inoverall liking. Conversely, low values of slope B mean that the attribute is unimportant(Moskowitz and Krieger 1995)

    More recently, it has been the fashion to plot overall liking versus sensory attributes, and fita curve to the data, shown in Figure 1 (Moskowitz 1981). The curve, a quadratic function,shows that liking first increases with sensory attribute level, then peaks, and then drops

    down. The specific form of this inverted U shaped curve will depend upon the specificproduct being testing, the sensory range achieved, and the preferences of the respondents.Sometimes the curve will be flat, showing that although respondents differentiate theproduct on a sensory attribute, the sensory differences do not drive liking (at least on a uni-variate basis). Sometimes the curve will show an inverted U shaped curve, other times thecurve will increase and perhaps flatten out slightly as if it were reaching an asymptote. Inother situations the curve will show a downwards sloping relation, suggesting that as thesensory attribute increases, liking actually decreases. There is usually quite a pronounced

  • 7/31/2019 Product and Package Testing

    36/56

    relation between flavor/aroma sensory levels and liking, but less of a relation betweentexture or appearance and liking. We know from basic research (Flavor Hedonic Levels)that the chemical senses (taste, smell) provoke more hedonic reactions (like Vs dislike)than do the visual, auditory, and tactile senses. Thus, for consumer goods (whether food orhealth and beauty aids), we should not be surprised to see these strong sensory-liking

    curves for the flavor attributes, but not for many of the appearance or texture attributes.

    Figure 1

    How flavor intensity drives overall liking. The researcher creates a scattergram, and

    fits a quadratic function to the data. The quadratic function shows where liking tends

    to peak, and reveals the degree to which liking co-varies with a single sensory

    attribute.

    Sensory Preference Segmentation

  • 7/31/2019 Product and Package Testing

    37/56

    Marketers are the first to recognize that consumers differ from each other. It should comeas no surprise that product test results are often reported by total panel, and by keysubgroups. Traditionally, these key subgroups comprise individuals falling into differentgeo-demographic classifications, into different brand usage patterns (e.g., users of brand XVs brand Y; heavy Vs light users of Brand Z, users versus non-users). Some of these

    segmentations also may be based upon responses to a battery of questions, which dividepeople into groups based upon pre-set classifications (Mitchell 1983). For the most partthese segmentations, whether by brand, by geo-demographics, or by other, overarchingpsychological schemata, do not show particularly striking differences in the consumerresponses to products. That is, in a product test these different subgroups of individualsappear to show quite similar reactions to the products themselves.

    Psychologists interested in the basic science of perception have introduced newer conceptsof segmentation. The fundamental idea behind sensory-based segmentation begins with the

    previously suggested relation - viz., as a sensory attribute increases, liking first increases,peaks (at some optimal sensory level), and then drops down with further increases insensory attribute (Moskowitz, Jacobs and Lazar 1985). It is important to keep in mind thatthe segmentation is based upon the sensory level at which a persons liking rating peaks,and not based upon the magnitude of liking. [A short digression is appropriate here.Segmentation based upon the pattern of liking ratings, rather than relations between likingand sensory attribute, could introduce artifacts. Respondents showing similar sensory-liking patterns may fall into different segments if one respondent assigns high numbers toproducts and the other respondent assigns low numbers to the same products]. Whenapplied to product test results (more than 6 products must be evaluated by each respondent)this segmentation, based upon the sensory-liking patterns, reveals that there are often two,

    three or more groups in the population with distinct sensory preference patterns. That is,each segment shows a different sensory level at where liking peaks. There may beindividuals who like strong flavored products, individuals who like weak flavored, lightercolored products, and the like. (See Figure 2, which shows data for coffee).

    Sensory preference segments transcend geo-demographics, brand usage, and values.Consumers in different segments may use the same brand, have the same values, etc. Thisorganizing principle for product development, emerging from product test results (alongwith emerging from basic psychophysics) is forming the foundation of new approaches forconsumer research and applied product development.

  • 7/31/2019 Product and Package Testing

    38/56

    Figure 2

    Sensory segmentation results from a five-country study of coffee. The left panel

    shows how overall liking for respondents from each of five countries (1-5) is driven by

    perceived bitterness. The right panel shows the same liking-bitterness relation, this

    time using data from sensory segments obtained from the five countries. Note thateach segment is represented in each country, albeit in different proportions.

    Product Mapping and Creating A Product Model From the Map

    At an even higher level of analysis the matrix of product x attribute rating is used to mapproducts into a geometrical space (see Figure 3). The geometrical space may comprise one,two, three or many more dimensions (although it is difficult to visualize a space of morethan three dimensions). Product mapping is done typically with disconnected products,usually those current in the product category, but occasionally with prototypes (so calledcategory appraisal, aptly named because the approach considers the full range of productscurrently in a category, without regard to the connectedness of their underlyingformulation; Munoz, Chambers and Hummer 1996).

    The locations of the product in the map are usually based upon the sensory properties of theproducts, with products located close together being sensorially similar to each other.Brand managers and product developers use these brand maps as heuristics to understandhow the consumer perceives the products as a set of interchangeable (or non-interchangeable) items.

  • 7/31/2019 Product and Package Testing

    39/56

    When the coordinates of the b