Sawtooth Software: The Survey Software of Choice

Highlights from Sawtooth Software Conference 2012

The 2012 Sawtooth Software conference drew about 260 people to the Boardwalk Hotel in Orlando, Florida. That’s a record turnout, and we’re very grateful for the support and enthusiasm of all involved.

The conference largely focused on conjoint analysis and MaxDiff methodologies, though there were two additional interesting topics involving text mining and cluster ensemble analysis.

It’s truly challenging to pick a few highlights to discuss in this short article, but there are some topics we at Sawtooth Software found particularly relevant to our work. In mentioning these, we risk offending some speakers that delivered valuable findings at the conference. We apologize, and invite everyone to read the full 2012 Conference Proceedings, which should become available for downloading from our website in the next few months.

Game Theory and Conjoint Analysis

Chris Chapman (formerly of Microsoft and now at Google) challenged the audience to work with internal and external clients to define the business problem in terms of strategic product positioning or pricing moves, and the potential responses of competitors. Then, rather than present utilities and importance scores to the client, the focus should be on what happens to share of preference (revenues or profits) when your client makes a move and competitors react. Expert opinion about the likelihood of different competitor reactions should be incorporated. By focusing on such competitive actions and reactions, the results of conjoint analysis become much more relevant to strategic decisions and the firm’s bottom line.

Anchored MaxDiff

Two presentations examined different ways for incorporating an anchor threshold of preference/importance (Johnson and Fuller, and Horne et al.) within MaxDiff. For example, this allows a respondent or segment of respondents who doesn’t like any of the items to have scores demonstrating that lack of preference (relative to the anchor). A few interesting points came out of these presentations. The indirect approach proposed by Louviere leads a significantly lower threshold than the direct approach: meaning that many more items will be deemed “important” or predicted to be “bought” compared to the direct approach. The direct approach as described by Lattery leads to what many researchers would consider to be more realistic assessment of a buying threshold. Both approaches seem to provide useful anchoring from a statistical standpoint, but the direct method actually contains more complete information and is easier for respondents. That said, the direct approach has drawbacks when the number of items gets large, as the grids involving direct ratings of items can induce lazy rating behavior, and also context bias. And, as a big drawback, when anchoring is used in MaxDiff, the 800-lb gorilla of anchor label interpretation and scale use bias (which we had avoided with standard MaxDiff) comes back into the room. Horne et al. showed that the absolute utility of the anchor varied quite a bit by country. So, with all the interest surrounding anchored MaxDiff, this is a particularly worrisome drawback.

MaxDiff with Large Lists of Items

Wirth and Wolfrath reviewed different approaches for conducting MaxDiff when the number of items gets especially large (such as 60 items or more). One approach is simply to create a sparse questionnaire, where each item appears just once per each respondent. HB estimation, even under these sparse conditions, led to MaxDiff scores for the sample that seemed to be of high quality. Even the individual-level hit rates of holdouts seemed relatively good, despite the quite sparse information at the individual level. The authors also examined what happens if not all the items are shown to each respondent, but each item is shown more times. This approach didn’t work quite as well as trying to include as many items as possible (though sparsely) within each respondent’s questionnaire. Certainly, if dealing with especially sparse data conditions (such as 100+ items), aggregate logit may be used (given substantial sample size), and it isn’t necessary for each respondent to be exposed to all the items.

How Many Choice Tasks Are Needed?

This oft-studied topic was again discussed at the 2012 conference, by Tang and Grenville, and separately by Kurz and Binner. Tang and Grenville approached the topic from the perspective of how many tasks are needed if the researcher plans to use the CBC data to classify respondents into segments. They concluded that about 10 tasks is satisfactory in most situations. Kurz and Binner re-analyzed existing CBC data sets, to see if the last tasks that respondents completed were actually of any value to final predictions. For some respondents, their decision rules are so simple that about six tasks were all that were needed to nail their preferences. For other respondents, the final choice tasks had increased noise, and actually did nothing to improve predictions. They found that predictions of holdouts could actually be improved by throwing out the last tasks for many respondents (chosen according to rules defining whether the additional tasks seem to be providing value at the individual level). They speculated that if an interviewing system could detect that additional tasks were providing no value, then an adaptive algorithm could terminate the CBC portion of the interview early. This would not only save time (and data collection costs), but lead to a better respondent experience with CBC surveys.

Design of CBC Experiments

Three presentations were given that focused on design of experiments (Kuhfeld/Wurst, Reed Johnson, and Huber). Huber pointed out that traditional measures for assessing the quality of designs (e.g. D-efficiency) have assumed that respondents answer CBC questionnaires according to the logit rule, and that increased task complexity does not lead to greater response error. Under these assumptions, increasing the difficulty of the choice tasks (by increasing the number of alternatives per task, the number of attributes that vary within a task, the utility balance, and also the number of tasks) always increases the D-efficiency. But, these same actions lead to greater response error, and therefore (especially in the case of utility balance) often no better utility estimation in the end. Huber emphasized that task difficulty should be considered carefully when creating experimental designs. Many marketing-related CBC studies can be done quite easily with five or more alternatives per task, but in many healthcare related CBC studies it’s extremely difficult for respondents to consider anything beyond pairs. Reed Johnson’s presentation focused on a meta-analysis of dozens of health-care related CBC studies. Reed found that increasing the sample size had a much larger effect on reducing error (holdout task prediction) than modest increases in D-efficiency resulting from more money/time spent in the design generation.

Menu-Based Choice

A presentation by our colleagues at SKIM (Cordella et al.) found that an internal method they developed for analyzing MBC data led to extremely similar results as the approach taken in our MBC software. Both methods led to good predictions of holdout menu tasks, including the ability to predict combinatorial selections at the individual level.

Assessing Unacceptable Levels in Conjoint Analysis

A presentation by Lattery/Orme confirmed that different methods of asking respondents to declare unacceptable levels led to overstatement of the degree of unacceptability of these levels. Respondents end up selecting “unacceptable” levels later in the holdout tasks. This may be due to errors in the holdout judgments, but it is probable that when respondents state that a level is unacceptable, they are probably just indicating that it is highly undesirable. The methods of analysis Lattery/Orme applied recognized “unacceptable” levels as inferior, but did not assume that concepts involving these levels would be unacceptable to respondents under all circumstances.