Sawtooth Software: The Survey Software of Choice

On the Design of MaxDiff Experiments

MaxDiff (Best/Worst) scaling has received a growing amount of interest lately. Papers on this topic have won "best presentation" at our two most recent Sawtooth Software conferences. MaxDiff provides a trade-off based alternative to standard rating scales for evaluating the desirability or importance of items. The items may be product features, products themselves, political candidates, brand names, etc.

The basic idea behind MaxDiff is to present respondents with (typically) from 4 to 6 items at a time, and ask which item (among the set) is "best" and which is worst. Respondents often complete a dozen or more choice sets.

MaxDiff questioning leads to more information gained per respondent effort than the classic Method of Paired Comparisons. For example, if among items A, B, C and D the respondent indicates that B is "best" and C is "worst," we can infer that B>A, B>C, B>D, A>C, D>C. From these two choices, MaxDiff can infer five of the six possible paired comparisons involving four items. Indeed, MaxDiff has been shown to provide results superior to Paired Comparisons in a recent methodological test (see "Maximum Difference Scaling: Improved Measures of Importance and Preference for Segmentation" by Steve Cohen, available in our Technical Papers Library at www.sawtoothsoftware.com).

Conducting MaxDiff Studies

MaxDiff studies may be fielded via paper-and-pencil or computerized questionnaires. Constructing these questionnaires requires determining which items appear in the different choice sets (an experimental design). MaxDiff experimental designs may be generated using our "Best/Worst Designer" software system. Analysis may be done using our Latent Class or CBC/HB software systems. (We hope to implement MaxDiff more integrally within a future version of SSI Web.)

Common questions for MaxDiff studies include:

  1. How many items in total?
  2. How many items per set?
  3. How many sets per respondent?

We recently performed a simulation study to investigate these questions with respect to individual-level estimation under hierarchical Bayes (HB). A detailed write-up of this investigation is found in the article entitled "Accuracy of HB Estimation in MaxDiff Experiments" also available in our technical papers library on the Web.

Simulation Study Methodology and Results

We used computer-generated respondents following known utility rules (plus random error) to answer different MaxDiff questionnaires. The questionnaires varied in terms of the number of items overall, number of items per set, and sets offered per respondent. Then, we estimated parameters for each "respondent" under HB and observed how well the estimated parameters could predict holdout choice questions answered by the same "respondents."

Figure 1 summarizes the results across all questionnaires in our experimental design, in terms of relative hit rates.

Conclusions

Based on the results of our simulation study, we conclude:

  1. Over the ranges studied, the number of items in the study has the greatest effect on accuracy of results
  2. The smallest effect is the number of items displayed per task
  3. One gains very little by showing more than five items per task
  4. Gains from increasing the number of tasks are nearly linear, at least to 20 tasks.

These results should be interpreted with caution, since they are based on simulated results. We look forward to studies that use actual respondents to confirm our findings.