Sawtooth Software: The Survey Software of Choice

Conference 2001: Summary of Findings

Nearly twenty presentations were given at our most recent Sawtooth Software Conference in Victoria, BC. We've summarized some of the high points below. Since we cannot possibly convey the full worth of the papers in a few paragraphs, we are making the complete written papers (not the presentation slides) available in the 2001 Sawtooth Software Conference Proceedings. If you haven't yet ordered your copy, please consider adding this valuable reference to your shelf. Call us at 360/681-2300 to order.

Knowledge as Our Discipline (Chuck Chakrapani): Chuck observed that many market researchers have become simply "order takers" rather than having real influence within organizations. He claims that very early on marketing research made the mistake of defining its role too narrowly. Broadening the scope of influence includes helping managers ask the right questions and becoming more knowledgeable of the businesses that market researchers are consulting. As evidence of the poor state of marketing research, Chuck showed how many management and marketing texts virtually ignore the marketing research function as important to the business process.

Chuck argued that, as opposed to other sciences, market researchers have not mutually developed a core set of knowledge about the law-like relationships within their discipline. The reasons include a lack of immediate rewards for compiling such knowledge, and an over-concern for confidentiality within organizations. Chuck decried "black-box" approaches to market research. The details of "black-box" approaches are confidential, and therefore the validity of such approaches cannot be truly challenged or tested. He argued that widespread practice of "Sonking" (Scientification of Non-Knowledge) in the form of sophisticated-looking statistical models devoid of substantial empirical content has obscured true fact-finding and ultimately lessened market researchers' value and influence.

Paradata: A Tool for Quality in Internet Interviewing (Ray Poynter and Deb Duncan): Ray and Deb showed how Paradata (information about the process) can be used for fine-tuning and improving on-line research. Time to complete the interview, number of abandoned interviews at each question in the survey, and internal fit statistics are all examples of Paradata. The authors reported that complex "grid" style questions, constant-sum questions, and open-end questions that required respondents to type a certain number of characters resulted in many more drop-outs within on-line surveys.

In addition to observing the respondent's answers, the authors pointed out that much information can be learned by "asking" the respondent's browser questions. Ray and Deb called this "invisible" data. Examples include: current screen resolution, browser version, operating system, Java enabled or not. Finally, the authors suggested that researchers pay close attention to privacy issues: posting privacy policies on their sites and faithfully abiding by those guidelines.

Web Interviewing: Where Are We in 2001? (Craig King and Patrick Delana): Craig and Patrick reported their experiences with Web interviewing (over 230,000 interviews over the last two years). Most of their research has involved employee interviews at large companies, for which they report about a 50% response rate. The authors have also been involved in more traditional market research studies, for which they often find Web response rates of about 20% after successful qualification by a phone screener, but less than 5% for "targeted" lists of IT professionals. They suggested that the best way to improve response rates is by giving cash to each respondent, though they noted that this is more expensive to process than cash drawings.

The authors reported findings from other research suggesting that paper-and-pencil and Web interviews usually produce quite similar findings. They reported on a split-sample study they conducted which demonstrated virtually no difference between the response patterns to a 47-question battery of satisfaction questions for Web and conventional mail surveys. The authors also shared some nuts-and-bolts advice, such as to be careful about asking respondents to type alpha-numeric passwords where there can be confusion. Examples include 0, o, O, vv, WW, the number "1" vs. small "L".

Defending Dominant Share: Using Market Segmentation and Customer Retention Modeling to Maintain Market Leadership (Mike Mulhern): Mike provided a case study demonstrating how segmentation followed by customer retention modeling could help a firm maintain market leadership. One of Mike's most important points was the distinction between using intention to re-purchase instead of customer satisfaction as the dependent variable. He argued that intention to re-purchase was better linked to behavior than satisfaction.

Mike described the process he used to build the retention models for each segment. After selecting logistic regression as his primary modeling tool, Mike discussed how he evaluated and improved the models. In this research, improvements to the models were made by managing multicollinearity with factor analysis, recoding the dependent variable to ensure variation, testing the independent variables for construct validity, and employing regression diagnostics. The diagnostic measures improved the model by identifying outliers and cases that had excessive influence. Examples from the research were used to illustrate how these diagnostic measures helped improve model quality.

ACA/CVA in Japan; an Exploration of the Data in a Cultural Framework (Brent Soo Hoo, Nakaba Matsushima, and Kiyoshi Fukai): Brent and his co-authors cautioned researchers to pay attention to cultural differences prior to using conjoint analysis across countries. As one example, they pointed out some of the characteristics of Japanese people that are quite unique to that country and might affect conjoint results. For example, the Japanese people's reluctance to be outspoken (using the center rather than extreme points on scales) might result in lower quality conjoint data.

They tested the hypothesis that Japanese people tended to use the center part of the 9-point graded comparison scale in ACA and CVA. They found at least some evidence for this behavior, but did not find proof that the resulting ACA utilities were less valid than among countries that tend to use, to a greater extent, the full breadth of the scale.

A Methodological Study to Compare ACA Web and ACA Windows Interviewing (Gary Baker and Tom Pilon): Gary and Tom undertook a pilot research study among 120 college students to test whether the results of two new software systems (ACA for Windows and ACA for Web) were equivalent. They configured the two computerized interviews to look nearly identical (fonts, colors, and scales) for the self-explicated priors section and the pairs section. The ACA for Windows interview provided greater flexibility in the design of its calibration concept questions, so a new slider scale was tested.

Gary and Tom found no substantial differences among the utilities for the two approaches, suggesting that researchers can employ mixed modality studies with ACA (Web/Windows) and simply combine the results. Respondents were equally comfortable with either survey method, and took equal time to complete them. The authors suggested that respondents more comfortable completing Web surveys could be given a Web-based interview, whereas others might be sent a disk in the mail, be invited to a central site, or could be visited by an interviewer carrying a laptop.

As seen in many other studies, HB improved the results over traditional ACA utility estimation. Other tentative findings were as follows: self-explicated utilities alone did quite well in predicting individuals' choices to holdout tasks--but the addition of pairs and HB estimation further improved the predictability of the utilities; and the calibration concept question can be skipped if the researcher uses HB and does not need to run purchase likelihood simulations.

Increasing the Value of Choice-Based Conjoint with "Build Your Own" Questions (David Bakken): David showed how computerized questionnaires can include a "Build Your Own" (BYO) product question. In the BYO question, respondents can configure the product that they are most likely to buy by choosing a level from each attribute. Each level is associated with an incremental price, and the total price is re-calculated each time a new feature is selected. Even though clients tend to like BYO questions a great deal, David suggests that the actual data from the BYO task may be of limited value.

David presented the results of a study that compared traditional Choice-Based Conjoint results to BYO questions. He found only a loose relationship between the information of the two methods. He concluded that a BYO question may serve a good purpose for product categories in which buyers truly purchase the product in a BYO fashion, but that larger sample sizes than traditional conjoint are needed. Furthermore, experimental treatments (e.g. variations in price for each feature) might be needed either within or between subjects to improve the value of the BYO task. Between-subjects designs would increase sample size demands. David pointed out that the BYO focuses respondents on trading off each feature versus price rather than trading features off against another. The single trade-off versus price may reflect a different cognitive process than the multi-attribute trade-off that characterizes a choice experiment.

Applied Pricing Research (Jay Weiner): Jay reviewed the common approaches to pricing research in marketing research: willingness to pay questions, monadic designs, the van Westendorp technique, conjoint analysis, and discrete choice. Jay argued that most products exhibit a range of inelasticity--and finding that range of inelasticity is one of the main goals of pricing research. Demand may fall, but total revenue can increase over those limited ranges.

Jay compared the results of monadic concept tests and the van Westendorp technique. He concluded that the van Westendorp technique did a reasonable job of predicting actual trial for a number of FMCG categories. Even though he didn't present data on the subject, he suggested that the fact that CBC offers a competitive context may improve the results relative to other pricing methods.

Reliability and Comparability of Choice-Based Measures: Online and Paper-and-Pencil Methods of Administration (Tom Miller, David Rake, Takashi Sumimoto, and Peggy Hollman): Tom and his co-authors presented evidence that the usage of on-line surveys is expected to grow significantly in the near future. They also pointed out that some studies, particularly comparing web interviewing with telephone research, show that different methods of interviewing respondents may yield different results. These differences may be partly due to social desirability issues, since telephone respondents are communicating with a human rather than a computer.

Tom and his co-authors reported on a carefully designed split-sample study that compared the reliability of online and paper-and-pencil discrete choice analysis. Student respondents from the University of Wisconsin were divided into eight design cells. Respondents completed both paper-and-pencil and CBC tasks, in different orders. The CBC interview employed a fixed design in which respondents saw each task twice, permitting a test-retest condition for each task. The authors found no significant differences between paper-and-pencil administration and on-line CBC. Tom and his colleagues concluded that for populations in which respondents were comfortable with on-line technology, either method should produce equivalent results.

Trade-Off Study Sample Size: How Low Can We Go? (Dick McCullough): In market research, the decision regarding sample size is often one of the thorniest. Clients have a certain budget and often a sample size in mind based on past experience. Different conjoint analysis methods provide varying degrees of precision given a certain sample size. Dick compared the stability of conjoint information as one reduces the sample size. He compared Adaptive Conjoint Analysis (ACA), traditional ratings-based conjoint (CVA), and Choice-Based Conjoint (CBC). Both traditional and Hierarchical Bayes analysis were tested. Dick used actual data sets with quite large sample sizes (N>400). He randomly chose subsets of the sample for analysis, and compared the results each time to the full sample. The criterion for fit was how well the utilities from the sub-sample matched the utilities for the entire sample, and how well market simulations for hypothetical market scenarios for the sub-sample matched the entire sample.

Because the data sets were not specifically designed for this research, Dick faced challenges in drawing firm conclusions regarding the differences in conjoint approaches and sample size. Despite the limitations, Dick's research suggests that ACA data are more stable than CBC data (given the same sample size). His findings also suggest that conjoint researchers may be able to significantly reduce sample sizes without great losses in information. Especially for preliminary exploratory research, sample sizes as small as 30 or even less may yield valid insights into the population of interest. In the discussion following the presentation, Greg Allenby of Ohio State (considered the foremost expert in applying HB to marketing research problems) suggested that HB should work better than traditional estimation even with extremely small samples--even sample sizes of fewer than 10 people.

Disaggregation with Partial-Profile Experiments (Jon Pinnell and Lisa Fridley): Jon and Lisa's research picked up where Jon's previous Sawtooth Software Conference paper (from 2000) had left off. In the 2000 conference, Jon examined six commercial CBC data sets and found that Hierarchical Bayes (HB) estimation almost universally improved the accuracy of individual-level predictions for holdout choice tasks relative to aggregate main-effects logit. The one exception was a partial-profile choice experiment in which respondents only saw a subset of the total number of attributes within each choice task. Jon and Lisa decided this year to focus the investigation on just partial-profile choice data sets to see if that finding would generalize.

After studying nine commercial partial-profile data sets, Jon found that for four of the data sets simple aggregate logit utilities fit individual holdout choices better than individual estimates under HB. Jon could not conclusively determine which factors caused this to happen, but he surmised that the following may hurt HB's performance with partial-profile CBC data sets: 1) low heterogeneity among respondents, 2) large number of parameters to be estimated relative to the amount of information available at the individual level. Specifically related to point 2, Jon noted that experiments with few choice concepts per task performed less well for HB than experiments with more concepts per task. Later discussion by Keith Sentis suggested that the inability to obtain good estimates at the individual level may be exacerbated as the ratio of attributes present per task versus total attributes in the design becomes smaller. Jon also suggested that the larger scale parameter previously reported for partial profile data sets relative to full-profile data might in part be due to overfitting, rather than a true reduction in noise for the partial-profile data.

One-Size-Fits-All or Custom Tailored: Which HB Fits Better? (Keith Sentis and Lihua Li): Keith began his presentation by expressing a concern he has had over the last few years when using Sawtooth Software's HB software due to its assumption of a single multivariate normal distribution to reflect the population. Keith and Lihua wondered whether that assumption negatively affected the estimated utilities if segments existed with quite different utilities.

The authors studied seven actual CBC data sets, systematically excluding some of the tasks to serve as holdouts for internal validation. They estimated the utilities in four ways: 1) by using the entire sample within the same HB estimation routine, 2) by segmenting respondents according to industry sectors and estimating HB utilities within each segment, 3) by segmenting respondents using a K-means clustering procedure on HB utilities, and then re-estimating within each segment using HB, 4) and by segmenting respondents using Latent Class and then estimating HB utilities within each segment.

Keith and Lihua found that whether one ran HB on the entire sample, or whether one segmented first prior to estimating utilities, the upper-level model assumption in HB of normality did not decrease the fit of the estimated utilities to the holdouts. It seemed unnecessary to segment first before running HB. In his discussion of Keith's paper, Rich Johnson suggested that Keith's research supports the notion that clean segmentation may not be present in most data sets. Subsequent discussion highlighted that there seemed to be enough data at the individual level (each respondent received usually about 14 to 20 choice tasks) that respondents' utilities could be fit reasonably well to their own data while being only moderately tempered by the assumptions of a multivariate normal population distribution. Greg Allenby (considered the foremost expert on applying HB to marketing problems) chimed in that Keith's findings were not a surprise to him. He has found that extending HB to accommodate multiple distributions leads to only minimal gains in predictive accuracy.

Modeling Constant Sum Dependent Variables with Multinomial Logit: A Comparison of Four Methods (Keith Chrzan and Sharon Alberg): Keith and Sharon used aggregate multinomial logit to analyze three constant sum CBC data sets under different coding procedures. In typical CBC data sets, respondents choose just one favored concept from a set of concepts. With constant sum (allocation) data, respondents allocate, say, 10 points among the alternatives to express their relative preferences/probabilities of choice. The first approach the authors tested was to simply convert the allocations to a discrete choice (winner takes all for the best alternative). Another approach coded the 10-point allocation as if it were 10 independent discrete choice events, by applying the winner take all method weighted by the allocation, with separate tasks to reflect each alternative receiving an allocation. Keith noted that this was the method used by Sawtooth Software's HB-Sum software. Another approach involved making the allocation task look like a series of interrelated choice sets, the first showing that the alternative with the most "points" was preferred to all other; the second showing that the second most preferred alternative was preferred to the remaining concepts (not including the "first choice"), etc. The last approach was the same as the previous, but with a weight given to each task equal to the allocation for the chosen concept.

Using the Swait-Louviere test for equivalence of parameters and scale, Keith and Sharon found that the different models were equivalent in their parameters for all three data sets, but not equivalent for scale for one of the data sets. Keith noted that the difference in scale could indeed affect the results of choice simulations. He suggested that for logit simulations this difference was of little concern, since the researcher would likely adjust for scale to best fit holdouts anyway. Keith concluded that it was comforting that different methods provided quite similar results and recommended the coding strategy as used with HB-Sum, as it did not discard information and seemed to be the easiest for his group to program whether using SAS, SPSS, LIMDEP or LOGIT.

Appropriate Analysis of Constant-Sum Data (Joel Huber and Eric Bradlow): Joel and Eric described how the cognitive process differs between standard choice and constant-sum (allocation) data. Joel stated that it is a simple task to ask respondents to choose one from among different alternatives. He argued that allocations are much more difficult, and may frustrate/confuse some respondents. But with certain purchases like soft drinks, breakfast cereals, and prescriptions given a diagnosis, an allocation may still make sense. Another method of allocation, which the authors used within their case-study analysis, was a volumetric allocation, where the respondent is under no restriction to make the allocated units equal any specific sum.

Joel and Eric then described a two-stage approach for analyzing volumetric choice data. The first stage was to convert the allocations to normalized probabilities of choice within sets and estimate the utilities under Hierarchical Bayes (HB) using Sawtooth Software's HB-Sum system. The second step employed Sawtooth Software's HB-Reg system. Volume for each alternative was modeled as a function of a constant for each respondent, a second variable reflecting the utility of each item as estimated by HB-Sum, and a third variable which was the utility of the other items within the current set as estimated by HB-Sum. Joel and Eric found acceptable fit (correlation 0.73) between the predicted allocation volumes and the actual volumes. They concluded that HB methods are critical for correct modeling, because of the heterogeneity in the ways people respond to the volumetric allocation task.

Modeling the No-Choice Alternative in Conjoint Choice Experiments (Rinus Haaijer, Michel Wedel, and Wagner Kamakura): Rinus and his coauthors addressed the pros and cons of including No-Choice (None, or Constant Alternatives) in choice tasks. The advantages of the None alternative are that it makes the choice situation more realistic, it might be used as a proxy for market penetration, and it promotes a common scaling of utilities across choice tasks. The disadvantages are that it provides an "escape" option for respondents to use when a choice seems difficult, less information is provided by a None choice than a choice of another alternative, and potential IIA violations may result when modeling the None.

Rinus provided evidence that some strategies that have been reported in the literature for coding the None within Choice-Based Conjoint can lead to biased parameters and poor model fit--particularly if some attributes are linearly coded. He found that the None alternative should be explicitly accounted for as a separate dummy code (or as one of the coded alternatives of an attribute) rather than just be left as the "zero-state" of all columns. The coding strategy that Rinus and co-authors validated is the same that has been used within Sawtooth Software's CBC software for nearly a decade.

The Historic Evolution of Conjoint (Rich Johnson and Joel Huber): Joel began this double-length presentation by describing how early mathematical psychologists Luce and Tukey first developed the theory behind conjoint analysis in the 1950s. Luce and Tukey forwarded a system of conjoint measurement, in which rank-order judgments on conjoined items (attribute levels or objects) could be converted to intervally scaled utilities. This was such a breakthrough because interval scaled utilities can support mathematical operations such as addition and subtraction, whereas rank-order data cannot.

In the early 1970s Paul Green recognized that Luce and Tukey's work on conjoint measurement might be applied to marketing problems. He introduced the idea of full-profile card-sort conjoint to the marketing community, effective for problems including about six attributes or fewer. Paul shifted the focus from full factorial arrays to fractional orthogonal plans, from ordinal estimation to linear estimation, and from an emphasis on tests to a focus on market simulations. Indeed, Rich pronounced, Paul Green must be considered the father of conjoint analysis.

About the same time, Rich Johnson was developing a system of pairwise trade-off matrices for estimating utilities for industry-driven problems having well over 20 attributes. Rich was unaware of the work of Paul Green and colleagues, which he noted would have been of immense help. He noted that practitioners like himself had much less interaction with academics than they do today. Rich discovered that trade-off matrices worked fairly well, but they were difficult for respondents to reliably complete. About that same time, small computers were being developed, and Rich recognized that these might be used to ask trade-off matrices. He also figured that if respondents were asked to provide initial rank-orders within attributes, many of the trade-offs could be assumed rather than explicitly asked. The same information could be obtained for main-effects estimation with many fewer questions. These developments marked the beginning of what became the computerized conjoint method Adaptive Conjoint Analysis (ACA).

Joel finished the presentation by describing the emergence of Choice-Based Conjoint (discrete choice) methods. He pointed to early work in the 1970s by McFadden, which laid the groundwork for multinomial logit. A later paper by Louviere and Woodworth (1983) kicked off Choice-Based Conjoint within the marketing profession. Propelled by the recent boost offered by individual-level HB estimation, Joel predicted that Choice-Based Conjoint would eventually overtake Adaptive Conjoint Analysis as the most widely used conjoint-related method.

Recommendations for Model Validation (Terry Elrod): Terry criticized two common practices for validating conjoint models and proposed a remedy for each. First, he criticized using hit rates to identify the better of several models because they discard too much information. Hit rate calculations consider only which choice was predicted as being most likely by a model and ignore the predicted probability of that choice. He pointed out that the likelihood criterion, which is used to estimate models, is easily calculated for holdout choices. He prefers this measure because it uses all available information to determine which model is best. It is also more valid than hit rates because it penalizes models for inaccurate predictions of aggregate shares.

Second, Terry warned against the common practice of using the same respondents for utility estimation and validation. He showed that this practice artificially favors utility estimation techniques that overfit respondent heterogeneity. For example, it understates the true superiority of hierarchical Bayes estimation (which attenuates respondent heterogeneity) relative to individual-level estimation. He suggested a four-fold holdout procedure as a proper and practical alternative. This approach involves estimating a model four times, each time using a different one-fourth of the respondents as holdouts and the other three-fourths for estimation. A model's validation score is simply the product of the four holdout likelihoods.

Basis Variables for Brand-Related Segmentation (Greg Allenby, Geraldine Fennell, Sha Yang, Yancy Edwards): Greg presented results from a commercial data set that asked respondents about 50 product categories, including brand preference and various segmentation variables. He and his co-authors found that the common demographics used in market research are poor predictors of brand preference. He pointed out that demographics are not useless--they are often fair predictors of category usage (i.e. whether a household buys diapers, or luxury cars) or of usage (heavy vs. light usage). He expressed that market researchers may find much stronger linkages to brand preference from contexts, situations, and underlying motivating factors. As one example, Greg presented a battery of questions related to the concerns, interests, and motivations surrounding tooth brushing that in turn might offer a tighter linkage to brand preference.