Sawtooth Software: The Survey Software of Choice

Conference 2006 Summary of Findings

The twelfth Sawtooth Software Conference was held in Delray Beach, Florida, March 29-31, 2006. The 170 attendees reflected the highest attendance that we have seen at our conferences in recent memory.

Summary of Findings

The summaries below capture some of the main points of the presentations. Since we cannot possibly convey the full worth of the papers in such few words, the authors have submitted complete written papers within the 2006 Sawtooth Software Conference Proceedings, available for ordering from Sawtooth Software. The book form costs $75 (includes a PDF copy of the proceedings on CD). The PDF document (on CD) costs $25.

Putting the Ghost Back in the Machine (Andrew Jeavons, Nebu USA): In this presentation, Andrew emphasized the difference in the level of human interaction with market research surveys as we've progressed from paper-based instruments, to phone, to CAPI, and now to web-based surveys. Human interviewers lead to different biases, but also certain benefits (such as ability to keep respondents from terminating prematurely). Andrew also reported results from a study that segmented respondents into introverts and extroverts. Respondents saw one of two versions of the web-based questionnaire: a colorful version or a plain version. He compared how introverts vs. extroverts responded to the plain or colorful surveys. He suggested that web surveys in the future might adapt to the personality and characteristics of the respondent to yield the highest completion rates and best quality data. He also considered the fact that the change in question modality between the web and CATI could cause a "cognitive shift" which could affect the processing of questions due the absence of an interviewer and this may be the basis for so-called modality effects. It may be that the web causes a cognitive shift that for some types of surveys may be unwanted.

Scalable Preference Markets (Ely Dahan, UCLA, Arina Soukhoroukova and Martin Spann, University of Passau): Ely presented a novel way to measure preferences for attribute levels or product concepts in the form of a stock-trading game. Respondents are first trained how to use an internet-based stock trading system. However, rather than trade stocks, the respondents traded product concepts or attribute levels. The price of a “stock” is established based on how each trader thinks the general market (the other participants in the game) will end up valuing each concept or level. Each respondent buys and sells stocks with the goal of winning: having the most total wealth at the end of the game (cash plus value of stocks). Through actual tests of this process, Ely and his colleagues found that respondents enjoyed the stock-trading game much more than taking conjoint surveys. They also found that the resulting preferences from the stock-trading game and conjoint surveys were quite similar. One weakness of the approach is that it provides no ability to estimate individual-level models.

Presenting the Results of Conjoint Analysis (Keith Chrzan, Maritz Research and Jon Pinnell, MarketVision Research): In this two-part session, the presenters demonstrated different approaches to presenting conjoint results. Keith focused on how to present to those unfamiliar with conjoint analysis, and Jon assumed a much more savvy audience. If required to present part worth utilities, both Keith and Jon favored scaling the results to be positive values. While conjoint importances provide an easy way to summarize the data and draw general conclusions between segments, both presenters acknowledged that the importances mean little by themselves, and can be misleading unless the full context of the range of levels within attributes is appreciated. Presenting market simulation results is more intuitive to those unfamiliar with conjoint analysis. Jon focused on game-analytic approaches to studying possible line extension moves and the possible reactions of competitors. He showed how the results can be prioritized by revenue, share, and profitability.

Assessing the Integrity of Convergent Cluster Analysis (CCA) Results (John A. Fiedler, Oreon Inc.): Is cluster analysis art or science? That was John’s initial question to the audience. Cluster analysis procedures always find clusters, whether actual delineation exists or not. The question becomes how to quantify the delineation between clusters, and whether such measures can justify the existence of segments. John cited early work by Sawtooth Software founder Rich Johnson on a measure called Cluster Integrity. Intuitively, it is the density of the mid region (between segments) compared to the densities of the regions near the cluster centers. John compared this measure of cluster integrity to the standard measure of reproducibility offered in CCA software (the ability of the algorithm to assign respondents into the same cluster given different starting points). He found little correlation between the measures for synthetic and real data sets. John felt that Cluster Integrity is a useful metric and could be employed when finding cluster solutions. He voiced his desire to see a new version of CCA software that featured both algorithmic and user-interface improvements. And, he concluded that cluster analysis really should be science.

Identification of Segments Determined through Non-Scalar Methodologies (John Pemberton and Jody Powlette, Insight Express): Many researchers are turning to non-scalar methods (such as MaxDiff and ranking tasks) to measure the importance of items and for use in developing segments. However, creating subsequent typing algorithms to assign new respondents into existing segments with these non-scalar techniques is much more challenging than with traditional ratings-based scale techniques (which can use methods such as discriminant analysis). John fielded a comparison study and found that 5-pt Likert scales, ranking tasks, and MaxDiff tasks produced very similar relative weights for the items. John described two methods for finding an efficient subset of pairwise comparisons that could identify segment membership: a discriminant-based approach and a Bayesian updating approach. The Bayesian approach performed slightly better in tests to predict segment membership within holdout samples.

Reverse Segmentation: An Alternative Approach (Urszula Jones, Curtis L. Frazier, Christopher Murphy, Millward Brown, & John Wurst, SDR/University of Georgia): One of the classic problems with cluster segmentation research is that segments developed based on attitudes or behavior variables rarely show many important differences when cross-tabbed against demographic or other targetable characteristics. Of course, clients prefer segments that differ on attitudes and that are also identifiable in the marketplace. Ula presented a method called Reverse Segmentation that helps solve these issues. Rather than cluster on each respondent, respondents are collapsed into objects based on the demographics or other targetable variables that seem to show significant differences among the attitudinal variables (determined through a series of ANOVA runs). For example, one respondent object may be composed of respondents with college education, low income, male gender, and having children. For each object, means are computed on the attitudinal variables. These objects are then clustered (based on the means of attitudinal variables). Ula presented results for a real dataset showing that the new approach produced segmentation schemes with significant differences among both basis and demographic variables.

Testing for the Optimal Number of Attributes in MaxDiff Questions (Keith Chrzan, Maritz Research and Michael Patterson, Probit Research): MaxDiff is becoming a mainstream procedure for estimating the importance or preference of items. One of the key decisions in designing a study is how many items to show within each set. Keith presented results from three different methodological studies that featured different numbers of items per set. Respondents were randomly divided into groups receiving as few as three to as many as eight items per set, with the total number of sets held constant. Increasing the number of items had a strong effect on time to complete each set. Three items per set seemed to have lower predictability. But, four to eight items per set performed about at parity. Keith highlighted that the extra time to ask seven or eight items per task may be put to better use asking more sets containing four to five items. He concluded that four to five items per set is about the right number to balance respondent difficulty and achieve high precision of estimates. One concerning point from this research is that the parameters often showed statistically significant differences depending on the number of items per set.

Product Line Optimization through Maximum Difference Scaling (Karen Buros, Data Development Worldwide): Karen illustrated the use of multiple variations on TURF searches for developing optimal product lines. The variations included: reach customers who like at least one item in the line; maximize the number of different items liked in the line; maximize the number who find their favorite item in the line; and maximize the “share of requirements” (similar to “share of preference”) satisfied by the line. Karen argued that MaxDiff is well-suited to TURF type optimizations, because it discriminates well among items and can lead to strong predictions at the individual level. When the search size is too large, exhaustive search may be infeasible. In those situations, Karen resorted to genetic algorithms (through a modified version of the Sawtooth Software ASM module). She also showed how client-friendly spreadsheet simulators could be delivered allowing managers to play what-if analysis around optimally-designed product sets, where the results were reported as percent of respondents reached, etc.

Agent-Based Simulation for Improved Decision-Making (David G. Bakken, Harris Interactive): Agent-based simulators represent a cutting edge technology that some leading researchers are investigating for market forecasting. David described agent-based simulations as representing complex systems (such as competitive markets) that result from decisions made by a collection of autonomous actors. The actions of these actors are controlled by specific decision rules and influenced by stochastic processes. David showed examples including word-of-mouth, the interaction between consumers and automakers, and the effect of advertising on awareness. The examples were programmed in Microsoft® Excel, but David also described NetLogo, an agent-based simulation toolkit. David concluded that the goal of agent-based simulations should not be a point-estimate prediction, but to achieve a distribution of outcomes that reflect the impact of varying conditions.

How Many Choice Tasks Should We Ask? (Marco Hoogerbrugge and Kees van der Wagt, SKIM Analytical): Marco began by citing earlier work on this same subject by Johnson and Orme, which focused on this question with respect to aggregate models. The key difference in this presentation was the emphasis on individual-level models (as estimated through HB). Marco presented a new way to think about the required number of tasks per respondent. He proposed that for each respondent, we can observe whether the addition of new tasks improves the ability to predict holdout tasks. Once the holdout predictability improves little, then he argued that the results have stabilized (converged) and additional tasks are of little value. Marco and Kees used cluster analysis to summarize the marginal improvements of additional tasks in predicting holdout tasks for relatively homogeneous groups of respondents. They found that for a number of data sets, holdout predictability stabilized for all groups after about the tenth choice task.

Sample Planning for CBC Models: Our Experience (Jane Tang, Warren Vandale and Jay Weiner, Ipsos Insight): Jane and Jay shared their experience of being in a company that executes many conjoint-related studies every year. A time-consuming part is discussing sample size during the early planning and bidding of projects. A client may have a general idea about an attribute list and may want early direction regarding rough sample size. To respond to these repeated requests in their organization, they built a spreadsheet tool that uses simple inputs to suggest reasonable sample sizes. They stressed that this tool does not replace the formal tests for design efficiency utilized by marketing scientists; but rather that it is useful for quick and rough initial direction. Jane and Jay extended the formula originally proposed by Johnson for aggregate CBC sample planning. Their extension accounts for percent usage of None, increased model complexity due to many attributes, and the effect of projected homogeneity of the sample. They demonstrated use of the tool with some practical examples.

Brand in Context (Andrew Elder, Momentum Research Group): What happens when an established brand in one marketplace seeks to leverage its brand equity to branch out into a new product category not typically associated with that brand? Such is the case with the “triple-play,” wherein telecom, cable, and internet providers seek to leverage their brand equity to provide services in related, but somewhat different spaces. Andy reported the results of a web-based survey that asked respondents about their perceptions and preferences on issues related to triple play, and examined the impact of brand modeling across multiple categories. When modeling adoption of the merged triple-play offering, traditional brand attributes carried similar weight as in a traditional “single-category” model, demonstrating that core branding components are not strictly bound to their established category. Yet Andy also found that certain attributes related to the category itself had a significant impact on triple-play adoption, suggesting clear contextual effects that affected the ability for certain brands to leverage their reputation into an emerging category. Andy concluded that brand relevance is shaped by perceptions and usage of the category with which the brand is associated, and that category characteristics should be considered an integral part of brand modeling and brand communication.

Brand Positioning Conjoint: A Revised Approach (Curtis L. Frazier, Urszula Jones, & Katie Burdett, Millward Brown): In a previous Sawtooth Software conference, Frazier and co-authors showed how to decompose the brand part worth into image elements through a two-stage approach. In the first stage, individual-level part worths for brand were estimated using a standard CBC approach followed by HB estimation. In the second stage, the part worths for brand were regressed on the ratings for brands on a variety of image attributes. In this updated presentation, the authors showed how the same type model could be built under single stage estimation. This is done by incorporating the brand ratings on the multiple items within the same design matrix as the CBC tasks. They conclude that the one stage approach is more robust. The information provided by this model helps clients understand the components of brand strength, and better understand the impact on product choice by strengthening the brand in terms of the image elements.

Rethinking (and Remodeling) Customer Satisfaction (Lawrence Katz, IFOP (Paris)): Lawrence suggested that there often is a misunderstanding concerning the goals of satisfaction research. The models often used reflect a structural analysis of the drivers of satisfaction which while interesting to clients at first are too stable over time to be useful as a tool for tracking. One common question is whether to place the overall satisfaction question at the beginning of a battery of image components or at the end. Lawrence suggested that this was more than just a simple methodological issue and that the two placements measure two distinctly different psychological constructs that he called “surface” and “deep” satisfaction. Whereas the latter is more appropriate for structural modeling using the usual additive models, it is surface satisfaction that best captures what respondents spontaneously think (and say) and the short-term consequences that might result. Surface satisfaction can also be modeled, but by using methods based on open-ended response data and coding schemes that allow for asymmetric attribute effects on satisfaction. Lawrence concluded by suggesting that a program for measuring satisfaction should focus on regular studies of surface satisfaction and relatively long periods between modeling of deep satisfaction.

Dual Response “None” Approaches: Theory and Practice (Chris Diener, King Brown Partners, Inc. and Bryan Orme, Sawtooth Software, Inc.): Dual-response None involves asking about the None alternative in a separate question. Respondents first choose among available products and then separately indicate if they would actually buy the product they chose. Another way of phrasing the second-stage question is to ask whether respondents would buy any of the products available in the task. These two approaches reflect ways to ask questions and to model the results. Dual-response None provides a safety net when the None usage is relatively high, because information about the other attributes and levels is not lost when a None response is recorded. The None parameter is much higher with the dual approach. Dual-response None may be modeled with standard MNL software, customized approaches (with the likelihood function as the joint probability across the two questions), and also using Sawtooth Software’s CBC/HB v4. All lead to similar results. Chris also found that modeling the dual-response None as a choice between the chosen alternative and None vs. a choice between all alternatives and None changed the size of the None parameter. Chris suggested asking respondents to consider the None with respect to the chosen alternative, but to model the results as if all alternatives were being compared to None.

“Must Have” Aspects vs. Tradeoff Aspects in Models of Customer Decisions* (Michael Yee, James Orlin, John R. Hauser, MIT, and Ely Dahan, UCLA): Standard conjoint analysis has been analyzed using a compensatory model, where deficiencies in one attribute can be made up for by strengths in other attributes. However, evidence suggests that respondents use non-compensatory strategies to choose product concepts. John described different heuristic rules that respondents might apply to screen products on certain characteristics. For example, a buyer might say: “I will consider flip phones, with mini-keyboards, from Blackberry.” John reviewed a practical method to infer the best lexicographic description of respondents’ (partial) rank data and showed results demonstrating that a non-compensatory model produces hit rates on par or better than going best-practices compensatory models. The non-compensatory model additionally yields insights for management regarding what aspects respondents screen on. The described method can be used with either traditional card-sort conjoint or choice-based conjoint.

(*Winner of Best Presentation award, based on attendee ballots.)

External Effect Adjustments in Conjoint Analysis (Bryan Orme and Rich Johnson, Sawtooth Software, Inc.): The market simulator is the most practical and useful deliverable of a conjoint analysis study. However, due to assumptions in conjoint analysis, the results usually don’t match actual market shares. Many researchers adjust conjoint models to better predict actual market shares. Bryan showed a method for adjusting for unequal distribution that avoids IIA assumptions and incorporates appropriate differential substitution effects. He also addressed the issue of scale factor, and how it related to random noise in buyer behavior. Bryan argued that adjustments for distribution and scale factor are theoretically defensible, given appropriate data. After that, some researchers additionally adjust the models to predict shares. Bryan showed that the standard Sawtooth Software external effect adjustment doesn’t perform as well as adjustments made to individual-level part worth utilities. Respondent weighting was also tested, but shown to change the behavior of the simulator in some extreme ways. Bryan emphasized that the use of external effects is a dangerous practice, and should be avoided whenever possible. But, if a project requires adjustments for forecasting purposes, some adjustments work better than others.

Confound It! That Pesky Little Scale Constant Messes up Our Convenient Assumptions (Jordan Louviere, University of Technology, Sydney and Thomas Eagle, Eagle Analytics, Inc.): Jordan reviewed the issue of scale factor: the size of MNL parameters is inversely related to error. As error in responses increases, the size of the estimated MNL parameters decrease, and vice-versa. For that reason, Jordan explained, MNL model parameters cannot be identified unless the scale factor is set to a constant. Without comparable scale factors, it is not appropriate to directly compare part worths from choice studies across respondents. Predictions from simulators can also differ significantly due to scale factor. Scale factor varies between consumers, between questionnaire instruments, and due to environmental differences. These issues affect HB and latent class models as well. Jordan described the use of covariance heterogeneity models that capture scale effects as well as mean effects. Using these models, he showed that respondents reflected a distribution of scale factors as well as a distribution of parameter estimates. Jordan argued that failure to pay attention to this (in random coefficient models) can result in biased and misleading inferences and predictions.

The Impact of Incentive Compatibility on Partworth Values (Min Ding, Pennsylvania State University and Joel Huber, Duke University): Most conjoint/choice questionnaires used in practice ask respondents to respond to hypothetical questions. In this presentation, Min and Joel focused on ways to motivate respondents to be more truthful and realistic in their responses. Incentive compatibility, the authors explained, involves motivating respondents to express their true feelings. One way to do this is to reward respondents with a probability of receiving an item that they choose in the preference questionnaire. In a recent JMR article, Ding and coauthors showed how incentive compatibility made conjoint models more predictive of respondents’ actual choices to a Chinese restaurant menu selection. Respondents receiving the incentive compatible condition were also more price sensitive than with standard choice questionnaires. In this research, the authors compared three preference measurement tools: choice, dollar matching, and self-explicated models. The incentive compatible groups showed better holdout predictability and different parameters, including a greater focus on price. Choice and self-explicated models performed about equally well in terms of hit rates, while dollar-matching lagged in performance.

The Economic and Psychological Influences of Bundling (Joel Huber, Duke University and Jon Pinnell, MarketVision Research): Joel and Jon reviewed the common reasons for bundling: reducing shoppers’ decision-making burden, increasing total revenue to the firm, and decreasing the operating costs (through greater efficiencies). They conducted two research studies involving bundling: one focusing on choices from a fast-food restaurant menu, and the other on choices of vacation packages. Some respondents completed questionnaires that did not involve bundled options (a la carte), and others saw a mix of bundled and a la carte options. In both studies, the authors did not detect much increase in revenue to the firm due to bundling. But, bundling shifted people’s choices: for example to more medium drinks (when bundled) and to more medium fries (when bundled).

Estimating Attribute Level Utilities from “Design Your Own Product” Data—Chapter 3 (Jennifer Rice and David G. Bakken, Harris Interactive): Jennifer and David presented a third paper in their series of investigations into Design Your Own Product (DYOP) questions. They discussed how this question type may be more realistic for certain purchase contexts. This final chapter focused on trying to estimate stable parameters at the individual level, and comparing the parameters to those from a standard CBC experiment. To estimate parameters at the individual level with DYOP, they included self-explicated questions on each of the levels in the study. They also employed HB analysis to combine information from the self-explicated questions with the DYOP choices. To estimate price parameters for each item, they included the relative price of each item (with respect to the total configured product’s price) in the design matrix. They achieved reasonably good predictions of holdouts with their model. The price sensitivity parameters differed significantly between CBC and DYOP, and DYOP price sensitivities were often much higher than those from CBC.

Simulating Market Preference with “Build Your Own” Data (Rich Johnson and Bryan Orme, Sawtooth Software, Inc. and Jon Pinnell, MarketVision Research): BYO (Build Your Own) tasks have received some interest over the last 10 years in the literature, and also at this conference. In BYO tasks, respondents design their optimal product, based on the features specified at given prices. Rich described an experiment that aimed to measure price sensitivity by feature by varying the prices across respondents. Respondents were also given a CBC questionnaire, to compare the results to BYO. BYO data can be analyzed using counts or through MNL by assuming that the respondent made one choice from the universe of all possible product design combinations. But, such an MNL model is often impossible to estimate with standard MNL software, as there can be billions of alternatives. Rich showed that counting data give essentially the same answer as the complex MNL. He also showed that simulators can be built using the logs of count probabilities as pseudo utilities. The part worths differed significantly in some cases from CBC, and there were definite context effects in BYO data. The between-respondents price variations did not lead to stable price sensitivity estimates for BYO data for Rich’s study, and he noted that much larger sample sizes would be needed. Rich suggested that use of CBC or BYO should depend on the choice process one wants to model, and one is not simply a substitute for the other.