Sawtooth Software: The Survey Software of Choice

Using Our HB Software: Tips from the Trenches

Hierarchical Bayes (HB) estimation has been of immense value to our users. For us as software developers, it’s like a gift from our academic colleagues, perfectly suited to our already popular conjoint analysis programs. The majority of our CBC users report using HB for their final models. ACA users also employ HB estimation, as it provides a more theoretically sound way to combine self-explicated priors with conjoint pairs.

Sawtooth Software offers four packages for HB estimation:

  • CBC/HB (estimation for CBC and MaxDiff data)
  • ACA/HB (estimation for ACA data)
  • HB-Reg (estimation for general regression problems with a continuous dependent variable)
  • CVA/HB (estimation for traditional full-profile conjoint data)

We’ve now had a few years experience using these software systems in practice and providing technical support to users. Based on what we’ve observed, we share some tips from the trenches.

Run Classical Estimation for Reference

Because of its use of priors, HB produces estimates for parameters even if you were to use poor experimental designs (such as too many prohibitions, or highly correlated predictor variables) or incorrectly parameterized models (such as specifying an interaction between two attributes involved in a prohibition). Sometimes, you’ll observe a clear lack of convergence of the parameters and recognize that something is terribly wrong. But, often the HB solution will seem reasonable, and you’d be hard pressed to know that there was a serious problem in your model.

With classical estimation methods like Ordinary Least Squares regression or Multinomial Logit, if there is a serious problem with your model, it will either “blow up” (e.g. “cannot invert matrix warning”) or report relatively large standard errors for some of the parameters. Therefore, we suggest applying classical estimation techniques prior to running HB.

Sparse Data and Overfitting

HB provides estimates of part-worth utilities (or betas) at the individual level. It does so whether there is substantial or very little information from each respondent. This seemingly magical property stems from HB’s ability to leverage prior information and data from other respondents. It’s both a powerful and scary notion that HB can provide estimates for an individual even if that person has provided no data! A person with no information is in essence “imputed” as a normal “draw” from the population distribution. The ability of HB to produce estimates in spite of sparse data often leads to users building larger models (more estimated parameters) than the data justify.

Overfitting occurs when we add parameters to a model that are not truly useful for predicting new (out-of-sample) observations. Here are some common examples that can lead to overfitting:

  • Using HB to estimate part-worth utilities for a large number of attributes under partial-profile CBC, where relatively few choice tasks have been completed by each individual. (If using partial-profile CBC, we should increase the number of choice tasks relative to our usual practice with full-profile CBC.)
  • Using HB to estimate excessive interaction terms in CBC analysis. (They often are not justified, and frequently there isn’t enough information at the individual level to estimate them well.)
  • Using relatively few observations per respondent relative to the number of parameters to estimate within HB-Reg.

Each of these situations is characterized by estimating many parameters using relatively sparse data.

Holdout choice tasks can help you determine if you are overfitting (you may need at least three or four for good results). Compare the ability of HB to predict holdouts vs. simpler models such as aggregate logit (a naïve model that assumes each respondent conforms to the population average) or latent class (assuming each respondent is a weighted average of the class utilities, weighted according to probability of membership in each group). If you find that HB doesn’t predict as well as the aggregate solutions, this is clear evidence of overfitting. Try reducing the Prior Variance assumption (an advanced setting in our HB software) and re-estimating. This will “smooth” respondents more toward population estimates and in these extreme cases reduce the likelihood of overfitting.

Some researchers don’t feel they have the luxury of using holdouts, because in the end they are “wasted”—not used in the final utility estimation delivered to clients. This doesn’t have to be the case. Once the holdout choice tasks have served their purpose for helping you build your final model, you can include them (together with the experimentally designed tasks) in your final utility estimation. Statistical purists may frown on this, but for practical purposes, it is efficient utilization of the data collection budget.

Do You Really Need that Interaction Effect?

We often find that interaction effects appearing to be significant under aggregate analysis (Counts or Logit) are actually due to unrecognized heterogeneity. For example, the same people that tend to prefer premium brands are the same people that are overall less price sensitive. The same people that like sports cars are the same people that tend to prefer bright colors in automobiles. A market simulator based on individual-level estimation and main effects only often works quite well in accounting for those interaction effects (that in the aggregate appeared as interactions) through market simulations and sensitivity analysis. For more information, see “Predicting Actual Sales with CBC: How Capturing Heterogeneity Improves Results” in our Technical Papers library at www.sawtoothsoftware.com.

Truly Huge Sample Sizes? Divide and Conquer.

Some users are fortunate enough to have truly huge sample sizes, such as greater than 5,000. This happens from time to time, especially with employee research. In years past (slower computers and less optimized code), we would worry about astronomic run times. However, both aspects of our technology are vastly improved, and a typical HB run for 10,000 respondents today may take around six to eight hours. True, that’s significant computational effort, but it can easily be done overnight.

A key assumption in our implementation of HB is that people are drawn from a normal distribution. However, if there are different groups of respondents that show substantial differences in part-worths, then it theoretically should be better to run HB within each group rather than combining respondents in a single HB run. There is an oft-cited paper from the Sawtooth Software conference (Sentis and Lee, 2001) that showed that this approach didn’t improve holdout predictions for seven commercial data sets, ranging from n=280 to n=800. But, that’s probably because the relatively small sample size per segment for these datasets led to weaker upper-level estimates (population estimates of means and covariances) that counteracted the potential benefit of segmenting the population.

If you have truly huge samples sizes, you may do better to break the sample into different groups using a segmentation variable that interacts with preferences and that results in subgroups that are substantial enough alone to support stable population estimates (e.g. n=1,000 or more). Perhaps better yet, a latent class solution may be used to develop the subgroups. After segmenting, you could set up multiple HB runs, spreading the computation burden across more than one machine, thereby slicing the total estimation time. With a sample size of 1,000 or more per meaningfully different segment, it is likely that you will see modest improvement rather than the slight degradation that Sentis and Lee observed after running HB within much smaller segments.

Scale Your Linear Terms Properly

In CBC/HB, you can optionally choose to estimate a single linear parameter to represent a quantitative attribute like price (we generally don’t advocate this, but it might be helpful under certain conditions.) The most common mistake is improper choice of level values (representing the actual prices) to be coded in the design matrix. Our implementation of HB assumes that each parameter has equal variance and a prior mean of zero. If you scale your price values in the X matrix in thousands of dollars, then the resulting utility coefficient will be very small (with much smaller variance than the priors had assumed). As a result, you may see lack of convergence, and biased estimates for the price coefficient relative to the other parameters.

We’d recommend that you scale the values placed in the X matrix to have a range of about 2 (reflecting the same range as the effects-coded values for the other attributes, which take on values of +1, 0, or -1). The range doesn’t precisely need to be 2, but that’s an appropriate target. A good option is to take the natural log of prices first, thus fitting log price. This tends to scale the price values used in the X matrix to a more appropriate range requiring little or no additional rescaling.

This same caution especially holds for HB-Reg. We’ve had users show us HB-Reg runs that never seem to converge (with parameters wandering substantially rather than oscillating around the true mean without noticeable trend.) We usually find that the X values have vastly different scaling. A useful step is to standardize variables in your X matrix prior to running HB-Reg.

And, don’t forget to explicitly include an additional term for the intercept in the design matrix, if needed. This often explains lack of convergence for models.

Do You Need the Self-Explicated Importances in ACA?

The latest version of ACA/HB (v3) allows the researcher to ignore the information provided by self-explicated importances. We’ve posted two papers on our website that question the reliability of self-explicated importances for ACA. (See “The ‘Importance’ Question in ACA: Can It Be Omitted?” and “Perspectives on the Recent Debate over Conjoint Analysis and Modeling Preferences with ACA”, in our Technical Papers library at www.sawtoothsoftware.com.

If you’ve collected self-explicated importances, we’d suggest running ACA/HB with and without the importance information included (under the Advanced Estimation Settings tab, uncheck Use respondent importances as constraints). Compare the average attribute importances before and after including the importances as constraints (or, better yet, compare the ability of the part-worths to predict holdout choice tasks). If you find that the inclusion of self-explicated importances strongly affects the final derived importances compared to estimation using the conjoint portion of the ACA interview only, this is evidence that the self-explicated information is biased relative to the tradeoff information. If including the self-explicated importances in HB estimation degrades the fit to holdouts, that’s even more compelling confirmation of a problem. Of course, omitting the self-explicated importance information discards information, so you should be careful about doing this if individual-level classification accuracy is the goal or when dealing with especially small sample sizes.

Using CBC/HB for MaxDiff? Don’t Forget to Apply the Prior Covariance Matrix (.mtrx)!

When using MaxDiff (best-worst scaling), both the MaxDiff Experiment Designer and the MaxDiff/Web software export a .CHO and associated files for use with CBC/HB. One of those files is the prior covariance matrix file, named .mtrx.

If the number of times each item appears in the questionnaire for each respondent is relatively low (especially below 3x), then the mean and variance of the last “omitted” item can be biased with respect to the other parameters. To avoid this, under the Advanced Estimation Settings dialog, make sure to check the Use custom prior covariance matrix box. (See Appendix G of the CBC/HB v4 documentation for more information.)