Sawtooth Software: The Survey Software of Choice

Summary of Findings from the 2007 Sawtooth Software Conference

The thirteenth Sawtooth Software Conference was held in Santa Rosa, California, October 17-19, 2007. The summaries below capture some of the main points of the presentations. We hope that these introductions will help you get the most of the 2007 Sawtooth Software Conference Proceedings.

The Weakest Link: A Cognitive Approach to Improving Survey Data Quality (David G. Bakken, Harris Interactive): David reminded us that our inferences and theories of consumer behavior are only as good as the data on which they are based. As researchers, we often apply conventional wisdom, “judgment” and some empirical evidence in designing questionnaires. But, often in our haste to take studies to field, we fail to pretest and refine our instruments. David reviewed previous work by psychologists regarding how humans interact with surveys. The four step model of survey response involves comprehension, retrieval, judgment, and response. He advocated the use of “Think Aloud Pre-Testing” in which respondents (10-20 per wave) verbalize their thoughts while answering survey questions. These tests should be conducted over multiple days to allow survey changes to be implemented and re-tested. Based on many such tests, David offered some observations regarding how respondents interact with web-based surveys and how they can be improved. Current problem areas include: grid questions, survey navigation, error messages, multi-lingual surveys, and CBC questionnaires.

Evaluating Financial Deals Using a Holistic Decision Modeling Approach (Paul Venditti, Don Peterson, and Matthew Siegel, General Electric): Paul described a very interesting approach that he and his co-authors are implementing within GE to evaluate complex financial deals. In the past, analysts have spent many hours evaluating financial deals and presenting the details of those deals to a committee of three individuals. Paul described how the characteristics of those deals could be defined using about 20 “conjoint” attributes. A modified ACA survey was developed to study three key individuals at GE who approve deals. The standard stated importance question in ACA was substituted with a constant-sum question implemented via an Excel worksheet. The final part-worth utilities were further modified by implementing a few non-compensatory rules (red flags). A market simulator based on the three respondents was found to be highly predictive of whether deals were approved or rejected in the months following the surveys (accuracy of about 80%). Paul’s work demonstrated that effective conjoint models (to profile tiny populations) can be built using tiny sample sizes. Conjoint analysis can provide good data for implementing sophisticated decision support tools in non-traditional contexts.

Issues and Cases in User Research for Technology Firms (Edwin Love, University of Washington School of Business, and Christopher N. Chapman, Microsoft Corporation): Edwin and Christopher described how conducting market research for technology products presents unique challenges. For example, innovative features are often not well-understood by respondents, and different user groups will have different levels of understanding. Also, features might not actually yet exist while the research is being conducted. The presenters commented that vague descriptions of attributes such as “easy setup” can skew user responses (toward expressing strong preference for nondescript features), and the results create the illusion of specific value where none may exist. They further recommended segmenting respondents based on product experience: owners vs. intenders. Edwin and Christopher illustrated the challenges of conducting market research for technology products via three case studies: a digital pen project, a webcam, and a digital camera.

Minimizing Promises and Fears: Defining the Decision Space for Conjoint Research for Employees versus Customers (L. Allen Slade, Covenant College): Conjoint analysis can be a valuable tool in both consumer and employee research. However, the researcher must recognize the key differences in how the firm interacts with the respondents. Allen affirmed that customers are less interdependent with the firm than are employees. And, different employees (depending on role and experience/training) are more highly interdependent with the firm than others. With employee research, the worry is of creating false promises of rewards or unwarranted fears of takeaways. Allen suggested that researchers ask themselves three key questions prior to including something in a conjoint survey for employees: 1) Would we be willing to actually do this?, 2) How does this intervention compare to the others we are considering?, and 3) How would an employee or customer react to taking this survey? Using an actual case study at Microsoft (total rewards optimization), Allen illustrated how applying these three questions led to effective research without undue promises or fears.

A Cart-Before-the-Horse Approach to Conjoint Analysis* (Ely Dahan, UCLA Anderson School): With traditional conjoint studies, respondents are often asked to complete long surveys, they are required to rate products they don’t like, and the resulting part-worth utilities often contain reversals in the utilities. Ely described a novel, computer-administrated and adaptive method of employing a traditional full-profile conjoint design. Rather than estimate part-worth utilities after respondents take the surveys, CARDS (conjoint adaptive ranking database system) begins with a researcher-constructed database of typically thousands of potential sets of consistent part-worth utilities. Respondents are shown a set of product concepts and asked to choose which products they prefer. After the respondent provides a few answers, the database of utilities is queried to determine if certain product concepts that haven’t yet been evaluated are clearly inferior (and should not be chosen next in order). Those products are deleted from the screen, allowing respondents to focus on those product concepts that are relevant to identifying which set of utilities best fits them, while forcing respondents to maintain consistent ordering. The benefit is much shorter questionnaires. The downsides are that early answers matter a lot, and there is no real error theory. Plus, the quality of the results depends on how well researchers can develop the database of potential sets of utilities.

(*Winner of Best Presentation award, based on attendee ballots.)

Two-Stage Models: Identifying Non-Compensatory Heuristics for the Consideration Set then Adaptive Polyhedral Methods within the Consideration Set (Steven Gaskin, AMS, Theodoros Evgeniou, INSEAD, Daniel Bailiff, AMS, and John Hauser, MIT): Steven reviewed the scientific evidence that suggests that people buy products by first forming a consideration set and then choosing a product from within the consideration set. This two-stage approach helps people deal with a large number of alternatives in the choices they face. By reflecting this process in our choice models, Steven argued that we can more accurately model choices, create more realistic and enjoyable surveys, and handle more features than conventional CBC. He presented a survey design in which respondents may use non-compensatory (cut-off rules) to form consideration sets. Respondents are then asked to tradeoff considered products within a more standard-looking CBC task. He and his co-authors employed FastPace CBC to estimate the utilities for the n most important compensatory features for each respondent. Steven reported results showing that respondents preferred the adaptive survey over standard CBC.

A New Approach to Adaptive CBC (Rich Johnson and Bryan Orme, Sawtooth Software): Existing CBC questionnaires have weaknesses: they are viewed as tedious and not very focused on the particular needs of each respondent. The experimental plans have assumed compensatory behavior, and previous research has shown that many respondents apply non-compensatory heuristics to answer conjoint questionnaires. Rich and Bryan presented a new technique for adaptive CBC that helps overcome these issues. Their approach mimics the purchase process of formulating a consideration set using non-compensatory heuristics (such as “must have” or “must avoid” features), followed by a more careful tradeoff of alternatives within the consideration set using compensatory rules. This new approach involves three core stages: 1) Build-Your-Own (BYO) Stage, 2) Screening Stage, and 3) Choice Tasks Stage. They conducted a split-sample experiment comparing the new approach to traditional CBC. They found that respondents liked the adaptive survey more and felt it was more realistic—even though it took about double the time as traditional CBC. Furthermore, part-worths developed from ACBC were more predictive of holdout tasks than traditional CBC, despite the methods bias in favor of CBC for predicting the CBC-looking holdouts.

HB-Analysis for Multi-Format Adaptive CBC (Thomas Otter, Goethe University): The three-stage interview proposed by Johnson and Orme is innovative, but the formulation of a model extracting the common preference information is a challenge. Thomas first showed that such a model is required, as simply discarding any of the data collected before the CBC part results in inconsistent inferences in an HB setting. Thomas then investigated different models: a multinomial likelihood for all parts of the interview allowing for task-specific scale factors, task-specific “wiggles” in the preference vector using the same likelihood, a binary logit likelihood for the screener part and a multichoice likelihood for this same part. Thomas found that the scale factor did vary considerably between the sections. However, accounting for task specific scales had only a small effect on the predictive ability of the models. Moreover, his results suggest that a binary logit or a multichoice likelihood for the screener part of the interview are preferable to the explosion into multinomial choices both in terms of the implied story about how the data are generated and the empirical fits.

EM CBC: A New Framework for Deriving Individual Conjoint Utilities by Estimating Responses to Unobserved Tasks via Expectation-Maximization (Kevin Lattery, Maritz Research): Kevin demonstrated how EM algorithms can be used to estimate individual-level utilities from CBC data. EM is often applied in missing values analysis. In the context of CBC, each respondent could be viewed as having been shown all the tasks in a very large design plan, but having completed only a subset of them. The missing answers are imputed via EM. Once missing answers have been imputed, there is enough information available to estimate part worths for each individual. Utility constraints may be implemented as well. Kevin faced a few challenges in implementing EM for CBC. He found that if he allowed EM to iterate fully to convergence, overfitting would occur. Therefore, he relaxed the convergence criterion. Kevin also found that the estimated probabilities for the tasks respondents did versus those that were missing varied in their means and standard deviations. So he adjusted the results from each task so that means and variances of the missing data were comparable to the observed data. He then repeated the EM process again until the missing data converged. Kevin compared utilities estimated under EM to those estimated via HB, and found that the EM utilities performed as well or better than HB utilities for three data sets.

Removing the Scale Factor Confound in Multinomial Logit Choice Models to Obtain Better Estimates of Preference (Jay Magidson, Statistical Innovations, and Jeroen K. Vermunt, Tilburg University): Jay reintroduced the audience to the issue of scale factor. The size of the parameters in MNL estimation is inversely related to the amount of certainty in the respondents’ choices. Because different groups of respondents may have different scale factors, it is not theoretically appropriate to directly compare the raw MNL estimates between groups. Jay showed how such comparisons can lead to incorrect conclusions. He then turned attention toward an extended Latent Class choice model to isolate the scale parameter. Using that model, he showed how latent class segmentations can differ for real data sets as compared to the generic latent class model that doesn’t separately model scale. In one particular comparison, Jay found that the amount of time respondents spent answering a CBC questionnaire was directly related to segment membership from standard latent class estimation (without estimating the scale factor). Jay also demonstrated how scale estimation can be incorporated into DFactor Latent Class models. Jay concluded that removing the scale confound in latent class modeling will result in improved estimates of part-worths and improved targeting to relevant segments based on an improved understanding of segment preferences and levels of uncertainty.

An Empirical Test of Alternative Brand Measurement Systems (Keith Chrzan and Doug Malcom, Maritz Research): Keith and Doug presented results from three commercial studies that compared different ways of collecting brand image data. Those methods included: Likert ratings, comparative ratings, MaxDiff, pick any, semantic differential, and yes/no scaling. They argued that the brand image measurement system should produce 1) credible brand positions (face validity), 2) strong differences among brands (discriminant validity), and 3) powerful predictions of brand choice (predictive validity). The first two research studies they reported on demonstrated that Likert ratings and pick any data were generally inferior to the other methods. The third study they reported compared semantic differential, comparative ratings, yes/no, and pick any data. They concluded that, of those four methods, comparative ratings had the most discriminating power, followed by semantic differential. Pick any data measured little beyond the halo effect (a complicating issue wherein brands/objects liked overall tend to get higher ratings across the board on the attributes). To help control for the Halo Effect, the authors double-centered the scores prior to making comparisons.

Alternative Approaches to MaxDiff with Large Sets of Disparate Items–Augmented and Tailored MaxDiff (Phil Hendrix, immr and Stuart Drucker, Drucker Analytics): Phil and Stuart investigated some enhancements to standard MaxDiff questionnaires to help deal with large numbers of items while still achieving strong individual-level scores. The authors argued that with more than about 40 items, MaxDiff becomes very tedious for respondents if individual-level estimates are required. To deal with this issue, the authors proposed that respondents first perform a Q-Sort task, wherein they drag-and-drop items into one of K buckets (they used 4 buckets in their research). The information from the Q-Sort task can be added to the MaxDiff information to improve the estimates. The Q-Sort task can also be used to create customized MaxDiff questions that principally draw on items of greatest preference/importance. Phil and Stuart conducted a split-sample study comparing standard and two forms of augmented MaxDiff exercises. They found that overall the aggregate parameters were very similar across the methods. But, both forms of augmented MaxDiff exercises outperformed ordinary MaxDiff in terms of holdout predictions. They also found that respondents found the Q-Sort + MaxDiff methodology more enjoyable than standard MaxDiff alone.

Product Optimization as a Basis for Segmentation (Chris Diener, Lieberman Research Worldwide): Chris motivated his presentation by reviewing the strategic goals and outcomes of traditional segmentation approaches. With attitudinal segmentations, one finds strong segments in terms of attitudinal differences, but those differences often do not translate into segments that differ strongly in terms of product preferences. With segmentation based on product features, the hope is that the segments have targetable differences and that the preferences translate to profitable product line decisions. If product optimization is used as the focus, then there is a stronger linkage with profitable product line decisions. Of all the methods of optimization, Chris stated that he prefers Genetic Algorithms. But, Chris pointed out that segmentation based on product optimization provides no guarantee that the segments will demonstrate targetable differences in terms of attitudes, media usage, or demographics. To improve the odds that the segments are useful, Chris advocated data fusion processes which combine information from attitude segmentation and product optimization segmentation, especially when the strategic priority is on product development and you are confident in being able to find an attitudinal story.

Joint Segmenting Consumers Using both Behavioral and Attitudinal Data (Luiz Sa Lucas, IDS Market Analysis): Luiz discussed segmentation methods that incorporate both behavioral and attitudinal data. Behavior data alone are often not satisfactory to use in segmentation schemes, because the segments do not necessarily map to anything useful in terms of descriptive demographics or attitudinal data. By the same token, attitudinal data alone are not sufficient because attitudes don’t necessarily correlate strongly with behaviors. Luiz reviewed multiple procedures for incorporating both behavior and attitudinal data in segmentation, including Reverse Segmentation, Weighted Distance Matrices, Concomitant Variables Mixture Models, Joint Segmentation, and LTA models. Luiz finished by discussing different fit metrics for determining the appropriate number of clusters.

Defining the Linkages between Cultural Icons (Patrick Moriarty, OTX and Scott Porter, 12 Americans): Patrick and Scott described a mapping methodology in which cultural icons (celebrities, brands, politicians) are placed within a perceptual map. The data are in part driven by a MaxDiff questionnaire. The goal is to provide a unique understanding of the strength of linkage between brands, personalities, and media properties based on consumer attraction. Their research identified that religion and marital status are the two social identities that on average most define individuals. But, identity may also be measured by the degree to which people express connection with cultural icons. The authors explained that cultural icons can also be measured and characterized, in terms of four key components: Recognition, Attraction, Presence, and Polarization. As an example of how their mapping methods can drive strategy, they showed relationships between either Hillary Clinton or Rudy Giuliani, segments of the population, and popular consumer brands.

Cluster Ensemble Analysis and Graphical Depiction of Cluster Partitions (Joseph Retzer and Ming Shan, Maritz Research): Joe described a relatively new technique in unsupervised learning analysis called Cluster Ensemble Analysis that has been suggested as a generic approach for improving the accuracy and stability of cluster algorithm results. Cluster ensembles begin by generating multiple cluster solutions using a “base learner” algorithm, such as K-means. Multiple solutions may be generated in a variety of ways. The basic idea is to combine the results of a variety of cluster solutions to find a consensus solution that is representative of the different solutions. Joe further demonstrated how the quality of cluster solutions can be graphically depicted in terms of Silhouette plots. The silhouette shows which objects lie well within the cluster and which are somewhere in between clusters. He finished by showing how cluster ensemble analysis can improve cluster results for a particularly difficult sample data set that has non-spherical clusters.

Modeling Health Service Preferences Using Discrete Choice Conjoint Experiments: The Influence of Informant Mood (Charles Cunningham, Heather Rimas, and Ken Deal, McMaster University): Chuck presented the results of a research study that investigated how depression influences performance on discrete choice experiments designed to understand patient preferences. Previous evidence in the literature suggests that people with depressive orders can have impaired information processing and a related host of decision making deficits. Because Chuck and his co-authors often use discrete choice experiments in health care planning issues, and because the incidence of depression is relatively high within populations they often survey, these issues were of interest to them. They found that although depression did not increase inconsistent responding to identical holdout tasks (test-retest reliability), it did influence health service preferences and segment membership. Chuck also reviewed basic principles for designing and analyzing holdout questions.

Determining Product Line Pricing by Combining Choice Based Conjoint and Automated Optimization Algorithms: A Case Example (Michael Mulhern, Mulhern Consulting): Mike presented the results of a recent study where the purpose was to develop an optimal pricing strategy for a product line decision. Six price levels were included in the study, and based on the plot of average utilities, there appeared to be two “elbows” in the price function. The elbows seemed to represent optimal pricing points for mid-price and a higher-price products. Mike used the Advanced Simulation Module to conduct optimization searches to maximize revenue. He found that the optimization routines also identified those same two price points as optimal positions. The different optimization algorithms (exhaustive, grid, gradient, stochastic, and genetic) produced identical results irrespective of the starting points (with the exception of the gradient search method, which had some inconsistencies). Mike’s client also asked whether the optimal price points would change depending on different assumptions for the base case. Altering the base case and re-running the optimizations revealed similar recommendations in most cases. Mike was able to report what the client eventually did and how actual sales volume compared to the simulation’s predictions. The client followed some of the recommendations, but ignored others. The sales results suggest that ignoring the recommendations provided by the optimization simulations was costly. A poorly positioned mid-price product foundered, as would have been predicted by the model.

Using Constant Sum Questions to Forecast Sales of New Frequently Purchased Products (Greg Rogers, Procter & Gamble): Greg compared two relatively common methods for measuring buyer intent for an FMCG category: CBC and constant sum allocations (both computer-administered). Not surprisingly, the constant sum allocation (out of 10 purchases) data were more “spiked” on the 0%, 10%, 50%, and 100% allocation probabilities relative to the probabilities projected from the pick-one CBC data. Greg expanded the analysis to include a Dirichlet model (to estimate base trial for a new item) that incorporated the issues of trial and frequency. Greg concluded that analyzing the brand choices from simple constant sum scales using a Dirichlet model results in comparable base trial estimates to those derived by CBC. This finding has implications for researchers that cannot use other methods like purchase intent (requires database to interpret) or CBC (can be relatively complex and costly) to estimate trial for new products.

Replacement Modeling: A Simple Solution to the Challenge of Measuring Adding and Switching in a Polytherapy Choice Allocation Model (Larry Goldberger, Adelphi Research by Design): In pharmaceutical research, doctors sometimes will prescribe multiple drugs to treat particular condition. When this occurs, the standard allocation models that assume that each patient is assigned a single drug therapy is violated. When this happens, the allocations may sum to more than 100%, so the allocation total is no longer fixed. Larry demonstrated a Polytherapy Allocation Model that does not assume that the total sum allocated per task is 100%. The proposed solution models the likelihood that a new product will substitute for an existing product, and does not constrain the sum to 100%. Larry also reviewed other common approaches to the problem, and discussed the limitations. He discussed the common binary logit approach to the problem, and how the cross-effects can often lead to reversals.

Data Fusion to Support Product Development in the Subscriber Service Business (Frank Berkers, Gerard Loosschilder, SKIM Group, and Mary Anne Cronk, Philips Lifeline Systems): Data fusion can involve combining different datasets to learn more than the original datasets had to offer individually. The authors explained how they used data fusion to help develop new strategies with respect to a subscriber service for Lifeline monitoring (the leader in North America for Personal Emergency Response Systems). Specifically, the authors were able to develop a plan of action to approach customers with increased communication regarding specific offers depending on the pattern of signals received from the subscriber. This provided an “early warning system” that would flag subscribers as in danger of deactivating their service. By implementing this system, subscriptions could be prolonged, resulting in greater profitability to the firm. The combination of behavioral patterns and background characteristics gave a better and clearer warning of imminent deactivation, and the type of deactivation, than the separate data sources could provide. Furthermore, the combined information provided greater clarity in deciding what services to offer, and when to offer them to subscribers.

Multiple Imputation as a Benchmark for Comparison within Models of Customer Satisfaction (Jorge Alejandro and Kurt Pflughoeft, Market Probe): Kurt emphasized that many studies must deal with missing data, and the degree of missingness can be significant. Different missing value routines will lead to different degrees of bias and imprecision for statistical estimates. The authors examined a variety of techniques to deal with or impute missing data: Casewise and Pairwise deletion, the Missing Indicator Method, Mean Substitution, Regression-based Imputation, Expectation Maximization (EM), and Multiple Imputation. They used a real dataset involving customer satisfaction for a bank, and induced missingness. After deleting values to induce missingness, they estimated regression models and compared the results to the same models prior to having missing data. They determined that Multiple Imputation appeared to be the best performer in terms of reducing bias and generally was more realistic in terms of standard errors. The Missing Indicator Method and Overall Mean substitution were generally biased, as the authors expected. Point estimates of EM worked well with regression, however SPSS’s imputed dataset was biased. Pairwise deletion performed well in this experiment in estimating stable beta coefficients.

Making MaxDiff More Informative: Statistical Data Fusion by way of Latent Variable Modeling (Lynd Bacon, YouGov/Polimetrix, Inc., Peter Lenk, University of Michigan, Katya Seryakova and Ellen Veccia, Knowledge Networks): Lynd demonstrated three different ways to think about coding and estimating MaxDiff data: differences coding, coding as two separate choice tasks, and rank imputed exploding logit. All three methods produced very similar results. The authors then turned their attention to a weakness in MaxDiff experiments: the scores are scaled with respect to an arbitrary intercept (rather than a common origin) for each respondent. This makes it hard to compare a single score from one respondent to a single score from another. They applied a different model (cutpoint model for ratings) which allows them to estimate the scores for items on a common scale with a common origin. They demonstrated how using the new model can improve the ability of researchers to identify respondents to target according to overall preference for a feature. Another point they emphasized is that the lack of scale origin issue also extends to attributes within standard discrete choice methods. The new model can be applied in those situations as well.

Endogeneity Bias—Fact or Fiction? (Qing Liu, University of Wisconsin, Thomas Otter, Goethe University, and Greg Allenby, Ohio State University): In theoretically proper applications of regression modeling, the independent variables are truly independent. However, in some market research applications, the independent variables are not truly independent. Examples include sequential analysis, time series models with lagged dependent variables, and Adaptive Conjoint Analysis (ACA). Greg suggested that endogeneity bias will matter whenever an adaptive procedure is used to learn about respondents (so that informative questions can be determined) and these data are excluded from analysis. However, with ACA, all of the information from each respondent is included in the estimation. Endogeneity bias only depends on whether you rely on the likelihood principle, and therefore, explained Greg, “being Bayes” or not matters. The presence of endogenously determined designs in ACA doesn’t affect the likelihood of the data. Although a small degree of bias is introduced in ACA due to endogeneity, the bias is typically quite small and ignorable.

CBC/HB, Bayesm and other Alternatives for Bayesian Analysis of Trade-off Data (Well Howell, Harris Interactive): HB has become a mainstream tool for analyzing results of DCM and related techniques (such as MaxDiff). There are a number of tools available for HB estimation, including Sawtooth Software’s CBC/HB product, bayesm (R package), WinBUGS, and Harris Interactive’s Hlhbmkl model. Well used three data sets to compare the different tools in terms of in-sample and out-of-sample fit. The speed of the different systems varied quite a bit, with CBC/HB being significantly faster than the other methods. Both the in-sample and out-of-sample fit was strongly affected by the tuning of the priors (the amount of shrinkage permitted). Tools other than CBC/HB offer some more advanced diagnostics and model specifications, including Gelman diagnostics for convergence, and respondent covariates in the upper level model.

Respondent Weighting in HB (John Howell, Sawtooth Software): When samples include subgroups that have been oversampled, it has been reported that this can pose some problems for proper HB estimation within CBC/HB software (which assumes a single, normally-distributed population). John investigated the degree to which this is a problem, and potential solutions. Using simulated data, John demonstrated that when subsamples are dramatically oversampled, it causes the means of smaller groups to shrink disproportionately toward the larger groups. This biases the sample means for the under-represented groups, and harms the accuracy of market simulations. John found that much of the problem is due to diverging scale factors between smaller and larger subgroups. The scale for the oversampled groups is expanded, leading to stronger pull on the overall sample mean. John found that normalizing the scale post hoc can largely control this issue. He also found that implementing a simple weighting algorithm within HB (computing a weighted alpha vector) can potentially improve matters further when there are extreme differences in sample sizes between subgroups. John suggested that other methods he didn’t investigate may improve estimation when some groups are oversampled, including developing models that estimate individual-level scale factors, models that involve less shrinkage (Students-t prior) or models that utilize multiple upper-level models. He concluded that regardless how the shrinkage problem is solved, models should be tuned for scale at either the individual or group level.