Sawtooth Software: The Survey Software of Choice

Conference 2004 Summary of Findings

Nineteen presentations were delivered at the eleventh Sawtooth Software Conference, held October 6-8, in San Diego, CA. We've summarized some of the high points below. Since we cannot possibly convey the full worth of the papers in a few paragraphs, the authors have submitted complete written papers within the 2004 Sawtooth Software Conference Proceedings available for ordering from Sawtooth Software.

It's Ethical Jim, But Not in the Way We Used to Know It! (Ray Poynter, UK):
Ray asserted that there is a near-global shift in attitudes regarding people in positions of authority, expert opinion, and research (whether dealing with "hard" or social sciences). People are much more skeptical today. At a broader level, these changes are due to very visible cases of personal/corporate corruption, citizen cynicism, lack of respect for public leaders, and the ubiquitous spin that is self-serving and often counterfactual. Within our own industry, there is an increase in direct marketing, falling response rates, an overall increase in polling, and a perception of convergence of direct marketing and market research.

Ray questioned how good the data are when we pursue some respondents relentlessly, after they have already refused multiple times to participate. He condemned improper uses of marketing research, such as SUGGING (selling under the guise of research) and push-polling. Legislation is increasing which may impact our ability to conduct research among the general population. As a result, the use of permission-based panels will likely increase. However, Ray questioned how ethical it is to report a 60% response rate for a panel survey, when the panel includes fewer than 1% of the population. Ray suggested we spend more resources training our staff in professional standards and to act ethically. He stressed that we consider the consequences of our research, and whether each project we undertake truly benefits respondents/consumers. And, we shouldn't necessarily "do unto others as you would have them do unto you" because their preferences may not be the same!

A Structured Approach to Choosing and Using Web Samples (Theo Downes-Le Guin, Doxus):
Theo pointed out that the use of online panels will continue to grow as representativeness of phone-based sampling deteriorates. Web-based research is increasing, and timelines for projects continue to shorten. He suggested that many researchers no longer distinguish sampling and data collection mode decisions--a web panel is a simultaneous choice of sample source and mode. As a result, the "traditional" process of balancing survey costs and errors as a basis for sampling decisions is compressed or eliminated. Theo reviewed the pros and cons of probability-based sampling vs. non-probability based sampling. He also spoke of snowball (referral sampling) to define a frame for hard-to-reach populations.

Theo suggested that we estimate the likely sources of error, and use this information to decide, separately, regarding sample source and data collection mode. We should find the appropriate balance among manageability, cost efficiency, and reduction of error. Practical questions to ask include: "If a frame-based approach is available, does it offer a compelling advantage?" "If an intercept-based approach is the only alternative, have we made the risks/pitfalls known to our clients?" "Can we combine more than one sampling approach, and then compare the results to guide future decisions?"

Optimizing the Online Environment: Examination of a Configurator Analysis Case Study (Donna Wydra, Socratic Technologies, Inc. and Debra Kassarjian, Taco Bell):
The authors demonstrated how Taco Bell is trying to use the voice of the customer earlier than before in product development and evaluation. Specifically, they presented a case study involving a line extension within burritos. Traditionally, item sorts and TURF analysis had been employed. This approach had resulted in relatively flat measures (low dispersion), where it was difficult to pick a winner. And, this type of analysis focused on a single product, rather than investigating a greater variety of options.

The authors demonstrated a different technique that they felt improved upon the previous lines of research: virtual product configuration. The survey was computer-based, with an attractive Flash-based exercise wherein respondents built their optimal burrito. Respondents could choose ingredients, and (with a variety of stimulating visual effects) watch the burrito being built before their eyes. After they built their idea burrito, respondents were asked Van Westendorp pricing questions, to investigate a reasonable range of price expectations. Cluster analysis revealed groups of individuals that chose similar ingredients. Even though the virtual configurator (as well as previous research) had limitations in terms of forecasting demand for new product concepts, the authors felt that the results were more insightful than previous research.

The Options Pricing Model: A Pricing Application of Best-Worst Measurement * (Keith Chrzan, Maritz Research): Best/Worst (or MaxDiff Scaling) is a relatively new technique forwarded by Jordan Louviere and colleagues back in the early 1990s. Keith showed how this technique could be applied to estimate the relative demand and price sensitivity for automobile options (such as sunroof, anti-lock brakes, etc.). Best/Worst questionnaires generally show a subset of the possible options, and ask respondents to indicate which is the best within the set and which is the worst. Eleven automobile features were tested at four price points each, among 202 consumers.

Keith analyzed the data using HB analysis, which yielded individual-level utility parameters (on a common scale) for each of the options at each price point. This permits the researcher to forecast which options (and at which price levels) would be most accepted in the marketplace. Utilities were converted to relative probabilities of choice by taking the antilog (exponentiating). Correlation between model prediction and self-reported actual purchase of options on the most recently purchased vehicle was 0.92. This suggested that the client could use the best/worst data to reasonably project future sales for not-yet-offered options and determine appropriate prices for each.

(*Winner of Best Presentation award, based on attendee ballots.)

Conjoint Analysis: How We Got Here and Where We Are: An Update (Joel Huber, Duke University):
Joel traced the history of conjoint analysis, from its roots in rigorous psychometric theory, to its eventual practical application in market research. Initially, conjoint measurement was proposed as a method to develop interval-scaling of multiple items from ranked observations. Early researchers applied conjoint measurement with human respondents, but found that many of the axiomatic tests (such as independence) were consistently violated. Even so, practitioners saw that the new measurement technique offered significant value, especially due to market simulation capabilities.

Practitioners replaced the rankings (early card-sort conjoint) with ratings (and later, choices), and replaced the full factorial designs with highly fractionated ones. Joel suggests that conjoint analysis has worked well because the simplification that respondents do in the conjoint survey often mirrors choice in the marketplace. Conjoint reflects how individuals might choose, given full information and more experience in making choices.

Choice has dominated ratings-based conjoint lately due to a number of factors. It is argued to better relate to market behavior, it emphasizes the competitive context, it is better for dealing with price/cost, and people are willing/able to make choices about just about anything. Surprises over the years have been the power of market simulators to account for differential substitution, the success of HB in predicting individual choices, and the difficulties of finding adaptive designs that outperform orthogonal ones.

The "Importance" Question in ACA: Can It Be Omitted? (Chris King, Aaron Hill, and Bryan Orme, Sawtooth Software):
For all the success of ACA (Adaptive Conjoint Analysis), a potential weakness has been the "self-explicated importance question" used in ACA's "priors." Sometimes, self-explicated importance questions are difficult for respondents to understand, and the responses typically show limited discrimination, with many ratings grouping at the high point of the scale. Historically, importance questions were needed to estimate final utility scores under OLS, for potentially dozens of attributes. However, with the availability of HB estimation, the importance questions might not be needed to stabilize individual-level ACA parameters.

The authors conducted a test to see if the importance questions could be skipped. Respondents were randomly assigned to receive a version of ACA that either included or didn't include the Importance section. Respondents who didn't receive the Importance section completed an extra six Pairs questions, though the total interview time was over a minute shorter than the respondents who completed traditional ACA. The authors found that traditional ACA (with the Importance question) achieved slightly better hit rates (for holdout choice tasks), but the share predictions were better when importances were omitted (in favor of six more pairs). Also, the final utility results without the pairs showed more discrimination among attributes ("steeper" importances) and provided different information from the utilities using the self-explicated importance information. The authors concluded that the Importance question could, if using HB analysis, be omitted.

Scale Development with MaxDiffs: A Case Study (Luiz Sa Lucas, IDS-Interactive Data Systems):
Luiz investigated how well MaxDiff scaling questionnaires could be used to quantify "seriousness of offenses" (crimes, such as stealing, committing violence, or even murder). A series of studies already available in the literature (from the 1960s and 1970s), used a questionnaire that asked respondents to answer using a ratio scale, where a "10" was assigned to the seriousness of bicycle theft. If an offense was perceived 10 times as bad as a bicycle theft, respondents were instructed to assign 100 points, etc. Using the data, researchers had developed a "power rule" that related the ratio of the seriousness of an offense to the scale of that offense.

Luiz found an excellent match between the scales he developed from MaxDiff, and the "power rule" relationships found in the previous studies. Luiz also investigated the use of latent class to develop segments of individuals, based on their MaxDiff judgments of the seriousness of offenses. He also collected attitudinal statements to classify people in different psychological segments, as suggested by previous research. The segmentation developed from MaxDiff seemed consistent with that in the literature. Luiz's case study lends additional credibility to the claim that MaxDiff results are both ratio scaled (after exponentiating the logit-scaled coefficients) and reliable. And, a MaxDiff questionnaire is probably easier for respondents to complete than the questionnaires employed by the previous authors.

Multicollinearity in CSAT Studies (Jane Tang & Jay Weiner, Ipsos-Insight):
Jane and Jay began their presentation by demonstrating how multicollinearity fouls our ability to derive stable betas under multiple regression. With synthetic data, correlated independent variables lead to unstable estimates of, say, drivers of customer satisfaction. The authors turned their discussion to a family of analytical methods that do a better job dealing with multicollinearity in CSAT studies: Kruskal's Relative Importance, Shapley Value Regression, and Penalty & Reward Analysis. These methods investigate all possible combinations of independent variables, and derive the contribution to the model of each variable, measured by the difference in some measure of fit when the variable is included vs. not included.

For both simulated and real CSAT data cases, the authors analyzed bootstrap samples to demonstrate that these methods result in much more stable estimates of drivers of satisfaction than standard OLS or stepwise OLS. They concluded that when the objective is to establish relative importance, rather than forecasting the dependent variable, methods that take into consideration all possible combinations of the explanatory variables are much more robust. Moreover, the margin of victory for these techniques increased as sample size decreases.

Insights into Patient Treatment Preferences Using ACA (Liana Fraenkel, Yale School of Medicine, and Dick Wittink, Yale School of Management):
In contrast to most applications of conjoint analysis that focus on marketing-related issues, Liana and Dick showed how ACA can be used to allow physicians to gauge the perspectives of individual patients who suffer from a chronic diseases such as arthritis, asthma, and diabetes. Doctors have an ethical obligation to respect patient autonomy and a legal obligation to provide informed consent. By interviewing patients using ACA and considering their preferences with respect to available treatment options, patients diagnosed with a specific chronic disease have an opportunity to learn which of the medically feasible options is most suitable for them at a given time.

The use of conjoint allows patients to express individual tradeoffs between efficacy, side effects, mode of treatment and costs. This avoids the limitation that physicians lack the time and the training to capture each individual patient's unique perspectives. A field experiment is in process to show whether the use of ACA changes prescription decisions, enhances patient satisfaction, improves health outcomes, and reduces total cost of care.

Modeling Conceptually Complex Services: The Application of Discrete Choice Conjoint, Latent Class, and Hierarchical Bayes to Medical School Curriculum Redesign (Charles E. Cunningham, Ken Deal, Alan Neville, and Heather Miller, McMaster University):
McMaster University recently conducted a study to determine how to improve the quality of its medical school program. Due to increased class sizes, continuing to offer small tutorial group sizes of 5 students was becoming more expensive. The authors designed a CBC study to investigate other enhancements to the curriculum that would result in greater student satisfaction overall, and compensate for planned increases in tutorial group size.

Using qualitative research, the authors developed a list of attributes important to the quality of the program. Fourteen attributes each on four levels were used in a web-based, partial-profile CBC interview. The authors employed Latent Class to investigate the preferences of segments of students. Two segments emerged: students whose preferences better aligned with McMaster's small group, problem-based tutorial curriculum, and a smaller group of students who seemed to favor a more traditional medical school program Based on market simulations, the authors made specific recommendations regarding lower-cost (but significantly preferred) options that could improve the existing curriculum, despite an increase in tutorial group size. The results also underscored the need to identify prospective students during the admission process that are a better fit with McMaster's curriculum.

Over-Estimation in Market Simulations: An Explanation and Solution to the Problem with Particular Reference to the Pharmaceutical Industry (Adrian Vickers, Phil Mellor, and Roger Brice, Adelphi International Research, UK):
Conjoint analysis is often used in pharmaceutical research. Due to the nature of product development in pharma, there is a strong demand for conjoint methods to deliver share predictions, in addition to utility estimates, sometimes three to five years prior to launch. The authors indicated that they typically see overestimation of share for new product entries using conjoint analysis. The overstatement of shares can often be 2x to 3x higher than market knowledge would suggest.

The authors described their typical questionnaires, including how they strive to establish a realistic setting (asking the physician to consider a specific patient when completing choice tasks), the use of partial profiles if more than six attributes, and HB estimation to obtain individual-level estimates. In addition to those standard procedures, they establish a cut-off threshold, applied at the individual level, and external effects, also applied at the individual level, to reflect other market realities for releasing a new drug. The full model takes into account any reluctance the physician may have about prescribing a new drug and also the volume restriction that results from third party payers and/or a physician's own consideration of what is a "fair" allocation among available product options. The amount of reduction depends upon such things as how serious and common the condition is, and the competitiveness of the market. The model still assumes 100% awareness of the new product, and the authors believe it represents an important step closer to predicting a realistic market share.

Estimating Preferences for Product Bundles vs. a la carte Choices (David Bakken and Megan Kaiser Bond, Harris Interactive):
Bundling is a common pricing strategy which under many conditions can yield greater overall revenues than by selling the components separately (a la carte). Common examples include "value meals" at fast-food restaurants. "Mixed bundling" involves offering buyers a choice between bundles versus buying the components a la carte. However, standard choice-based conjoint approaches do not capture the choice process for mixed bundling strategies.

Based on a real client need, David and Megan designed a more complex CBC-like choice task that incorporated the notion of mixed bundling. Respondents (web-based survey) were instructed that they could either purchase the components as a bundle from a single manufacturer, or they could purchase the components separately from different manufacturers. Such a complex task required careful questionnaire design, pre-test, and detailed explanations and examples. The key to their solution was to develop two types of models from the data: a set of part worths predicting purchases of the bundles (including a coefficient for non-purchase of bundle), and a set of part worths predicting selections of each a la carte item given rejection of the bundle. The authors built a spreadsheet simulator that included a simple "if" condition to determine whether the shares of preference for each individual would be captured by the bundled alternatives or the a la carte alternatives.

The Importance of Shelf Presentation in Choice-Based Conjoint Studies (Greg Rogers, Procter and Gamble, and Tim Renken, Coulter/Renken):
The grid-based approach and the shelf display are two common layouts for studying packaged goods using CBC. In the grid-based approach, a subset (often 9 to 12) of SKUs (graphics or text-only) are displayed on the screen. With the shelf-based display, all SKUs (often 20 to 30) under study are shown in each choice task, represented graphically on rows that look very much like store shelves. Shelf display seems to more accurately reflect the purchase experience, both in terms of available products and the general look of the environment. Greg and Tim hypothesized that if the display of products in the choice task were more like a store shelf layout, respondents would behave more like they really do when they shop.

The authors fielded a CBC study using the different layout approaches in CBC, covering multiple product categories. The criteria for success were: 1) ability to capture the same price elasticity estimates as are obtained using an IRI Marketing Mix Model (regression-based model using actual scanner sales data), and 2) ability to predict actual market shares. They found that the shelf layout provided slightly better fit to econometric models of price sensitivity, but that the grid layout provided slightly better fit to share. Importantly, the authors found that the estimates of price sensitivity from CBC were on average unbiased with respect to the scanner data models, though they often missed by a significant margin when any one product or brand was considered. They concluded that both exercises yield similar results, though the shelf display seems to have greater face validity, and is therefore easier to sell to clients.

The Effect of Design Decisions on Business Decision-Making (Curtis Frazier and Urszula Jones, Millward Brown-IntelliQuest):
There has been a significant amount of research done probing the strengths of different flavors of discrete choice and methods for estimating part worths. Most studies have focused on hit rates and prediction accuracy for holdout shares. Curtis and Urszula focused their research on how different design decisions for discrete choice studies might affect the outcome in terms of business decision-making. The authors fielded a discrete choice study among 2400 respondents to a web-based survey. Experimental treatments were: full profile vs. partial profile; and no "None" alternative vs. including a "None" alternative vs. using a follow-up "None" question to each task. The research spanned three different product categories: Digital Cameras, HDTV and MP3 Players.

The authors found that partial profile tended to dampen the relative importance of price and increase the importance of brand, relative to full-profile. Including a "None" concept (either within the task, or as a second-stage question following the choice task) tended to increase the relative importance of price. To test how these differences might affect business decisions, the authors created hundreds of potential simulation scenarios, and used the part worths from the various treatments to determine "optimal price" points to maximize revenue for a client's hypothetical offering. As suggested by the findings with regard to "importances," partial profile designs lead to a significantly higher optimal price point. Including a "None" option in the questionnaire yielded the lowest derived optimal price points. Also, asking the "None" as a separate follow-up question produced much higher overall "None" usage, relative to when "None" was included in the choice task. Curtis and Urszula hypothesized that when "None" is included in the choice task, respondents may wish to appear cooperative by avoiding use of the "None." The authors concluded that different design decisions often have modest effect on holdout hit rates and share prediction accuracy, but can have a much bigger impact on business decisions, such as finding the right price points and projecting overall demand by relying on the scaling of the "None."

Application of Latent Class Models to Food Product Development: A Case Study (Richard Popper, Jeff Kroll, Peryam & Kroll Research Corporation, and Jay Magidson, Statistical Innovations):
Food manufacturers need to understand the taste preferences of their target consumers, but taste preferences are often not homogeneous. Preference segments exist, and recognizing these differences may lead to better products that appeal to different segments and increase overall sales. The authors studied crackers using 18 flavor attributes, 20 texture attributes, and 14 appearance attributes. A trained sensory panel of 8 individuals rated the fifteen test crackers on the various attributes using 15-point intensity scales. The average ratings from the sensory panel were used as independent variables. 157 respondents rated all 15 crackers (dependent variable) over a period of three days. A completely randomized block design balanced for the effects of day, serving position, and carry-over.

Different models were used to detect consumer segments according to their liking ratings for the crackers. Four main models were tested: Latent Class (LC) Cluster model (nominal factor), LC Factor model (discrete factors), LC Regression model with a random intercept (nominal factor + one continuous factor) and a parsimonious non-LC regression model (two continuous factors). Latent GOLD software was used. The authors concluded that there was clear evidence of segment differences in consumers' liking ratings. Respondents reacted similarly to the variations in flavor and texture, but differed with regard to how they reacted to the products' appearance. Many other details regarding the relative strengths of the different models are covered in the full written paper.

Assessing the Impact of Emotional Connections (Paul Curran, Greg Heist, Wai-Kwan Li, Camille Nicita, Bill Thomas, Gongos and Associates, Inc.):
The authors suggested that more advertisers these days are relying on forging emotional connections with their audience rather than relying principally on value propositions. The feeling is that in some markets, emotional connections are greater drivers of purchase and loyalty than other influences (such as utilitarian). The challenge faced by the researchers was how to quantify consumers' emotional connections with specific brands of automobiles. The authors conducted a great deal of qualitative research prior to designing the final quantitative instrument. The qualitative stage involved asking respondents to bring a collage of images and/or words to the interview that illustrated their feelings toward an ideal brand/product. Based on the preliminary qualitative effort, the researchers assembled characteristic images and words into vignettes that could capture specific emotional values. These stimuli were then used in a discrete choice exercise.

The authors felt that a key aspect to the research was a "negative priming" exercise. Based on previous research in psychology, the idea was to mentally overload and distract respondents in order to hinder overly-conscious thinking. After the negative priming exercise, respondents completed a discrete choice task involving eight emotional drivers on two dimensions for each vehicle: "How do you want to feel about a vehicle?" (Importance) and "Which do you associate with the [brand]?" (Brand Association). The authors examined the data by respondent segments based on automobile ownership. The most important emotional drivers were "peace of mind" and "smart and practical." A perceptual map (using correspondence analysis) displayed the results of the brand associations. VW was strongest on "fun to own" and "happy and carefree," Honda owned "peace of mind," and Saturn was strongly associated with "care for others." The authors concluded that "Independent or self-reliant" represented a positioning opportunity that no one vehicle measured currently fills.

Item Response Theory (IRT) Models: Basics, and Marketing Applications (Lynd Bacon & Jean Durall, LBA Ltd., and Peter Lenk, University of Michigan):
Item Response Theory (IRT), also called latent trait models, originated in the educational testing literature to measure jointly subjects' latent traits or abilities and test item difficulty. Lynd, Jean and Peter described the simplest IRT model, the Rasch model, which assumes that a subject's performance on a test is determined by his or her latent trait and the difficulty of the test items. The latent trait is a random effect that varies across subjects, and item difficulty is a fixed effect for each item. The authors point out that the Rasch model is strongly related to CBC hierarchical Bayes models that are derived from random utility theory (RUM).

The authors suggested that IRT provides a rich framework for test item construction, which may have potential in marketing research. Test items can be characterized by their two parameters: discrimination and difficulty. Both of these parameters are combined in the item information function (IIF), which summarizes how much information an item has in estimating the latent trait for different values of the latent trait. One can easily imagine developing a large bank of marketing research items for different concepts, such as loyalty and satisfaction, where the items are indexed by their IIF. A marketing researcher could then select items from these banks to construct survey instruments for various purposes, such as studying highly loyal customers or dissatisfied customers. IRT enables the design of adaptive, online surveys. After obtaining an initial estimate of a subject's latent trait, an adaptive survey might select items to better estimate the trait with fewer responses. Instead of using a "shot-gun" approach to survey design, marketing researchers could be more strategic and systematic by employing the IRT framework.

Avoiding IIA Meltdown: Choice Modeling with Many Alternatives (Greg Allenby, Ohio State University and Jeff Brazell, The Modellers, LLC):
IIA (Independence from Irrelevant Alternatives) is a property of logit models that can sometimes pose problems for market simulations-especially where there are many choice alternatives being studied. Greg and Jeff investigated a new model using data from a discrete choice project involving over 1000 different automobile concepts. In typical choice modeling, a unique error term is associated with each alternative, and when the choice set is large, adding a near-duplicate offering results in nearly a doubling of predicted share. Greg and Jeff's solution involved restricting the error space (the number of unique error terms) by assigning the same error realization to choice alternatives that share common important attributes (e.g., brand name). This avoids the "proportional draw" property of the logit model for these alternatives, resulting in a model with more reasonable predictive properties.

Predictive accuracy of actual market shares for automobiles showed a small improvement for the error-restricted models relative to traditional error specifications, despite the restrictive assumptions made about the error terms. The authors also illustrated their solution using a packaged-goods problem.

A Second Test of Adaptive Choice-Based Conjoint Analysis (The Surprising Robustness of Standard CBC Designs) (Rich Johnson, Sawtooth Software, Joel Huber, Duke University, and Bryan Orme, Sawtooth Software):
This presentation featured a second trial of a new adaptive design method for choice-based conjoint questionnaires. In a previous conference, Rich and his co-authors reported improved predictions of holdout choice shares (compared to standard CBC) when using customized (adaptive) experimental designs. The adaptive algorithm creates a unique design for each respondent, where the questions are chosen to maximize D-efficiency. Because part worths affect D-efficiency (utility balance yields greater statistical efficiency), preliminary estimates of part worths are needed for each respondent. In previous research, ACA-like self-explicated priors were used. In the current research, the authors dropped the self-explicated attribute importances and used only within-attribute level ratings. They also investigated both full-profile and partial-profile questionnaire formats.

1009 respondents completed a web-based study, randomly receiving one of four questionnaires (ACBC vs. CBC crossed by full profile vs. partial-profile). The hit rates and share predictions for the ACBC vs. CBC treatment were similar. However, holdout predictions were better for full profile than partial profile when predicting choices of full-profile holdouts, probably due in part to methods bias. The authors were puzzled why ACBC didn't perform better. Upon further investigation, they discovered that the adaptive designs were indeed about twice as efficient with respect to the information they were given (initial self-explicated part worths). However, the self-explicated information used to generate the designs was not accurate enough to produce efficient designs with respect to the final CBC-derived part worths. Omitting the "importance" question was apparently a mistake. The authors suggested that rather than depending on self-explicated information, a better method in future research might include using part worth information from previous respondents (through HB or Latent Class) as initial estimates, and updating those estimates after each respondent answer.