Testing the CBC Design

Top  Previous  Next

Introduction

In CBC/Web, a design refers to the sum total of the task descriptions across all respondents. The design contains information about the combinations of attribute levels that make up the product concepts within the tasks. The design is saved to a design file that you upload to your web server. Optimally efficient CBC designs can estimate all part worths with optimal precision; meaning that the standard errors of the estimates are as small as possible, given the total observations (respondents x tasks), the number of product concepts displayed per task, and respondent preferences.

CBC/Web's random design strategies generally result in very efficient designs. These designs are not optimally efficient, but are nearly so. In the case of large sample sizes, a large number of questionnaire versions in the design file, and no prohibitions, one can confidently field a questionnaire without testing the design.

However, there are conditions that can result in inefficient designs. Sometimes, a design can be so inefficient as to defy all attempts to compute reasonable part worth utilities. We have heard of entire data sets with hundreds of respondents going to waste because the user neglected to test the design.

Therefore, it is imperative to test your design whenever any of the following conditions exist:

·any prohibitions are included  
·sample size (respondents x tasks) is abnormally small  
·the number of versions you plan to use is few  

CBC/Web's default Test Design capability (OLS Efficiency) only tests the efficiency of main effects (the separate utility estimate for each attribute level). It provides a good approximation of the relative efficiency of the CBC design with respect to each attribute level. Many researchers using standard CBC plans find this default test adequate for their purposes. For more sophisticated design testing, an Advanced Test option is available for simulating "dummy" respondent answers and reporting the standard errors (from a logit run) along with D-efficiency.

Our design testing methods assume aggregate analysis, though most CBC/Web users eventually employ individual-level estimation via CBC/HB. That said, CBC/Web's design strategies can produce designs that are efficient at both the aggregate and individual levels (though we don't specifically provide valid measures of the standard errors for individual respondent estimates within our test design procedures).

Prohibitions are often the culprit when it comes to unacceptable design efficiency. If your prohibitions result in unacceptably low design efficiency under the Complete Enumeration or Balanced Overlap Methods, you should try the Shortcut or Random design strategies. These latter two methods are less constrained than the more rigorous former ones, and will sometimes result in higher design efficiencies in the case of many prohibitions.



Testing the Efficiency of Your Design


When you choose Test Design from the Specify CBC Interview Parameters dialog, CBC/Web automatically tests the design and displays the results within the results window (the results are also saved to a file named STUDYNAMEtest.txt). CBC/Web automatically generates a data file of simulated (dummy) respondent answers appropriate for advanced design testing.

CBC/Web includes two test procedures:

·Test Design (Frequencies and OLS Efficiency)  
(the default test routine used in previous versions of CBC/Web, based on OLS theory)  
 
·Advanced Test (Simulated Data, Logit Report, and D-Efficiency)  
(a more rigorous test, based on conditional logit theory)  



Test Design (Frequencies and OLS Efficiency)

Following is a sample report, as it appears in the results window:

         CBC Design Efficiency Test
      Copyright Sawtooth Software

      Task Generation Method is 'Complete Enumeration' using a seed of 1
      Based on 10 version(s).
      Includes 1000 total choice tasks (10 per version).
      Each choice task includes 3 concepts and 6 attributes.

       A Priori Estimates of Standard Errors for Attribute Levels
       ---------------------------------------------------------------
        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B
          1 3      75   0.2841       0.2835        0.9958    Brand C
          1 4      75   0.2936       0.3062        1.0873    Brand D

          2 1     100   (this level deleted)                 1.5 GHz
          2 2     100   0.2207       0.2182        0.9776    2.0 GHz
          2 3     100   0.2275       0.2182        0.9200    2.5 GHz

          3 1     100   (this level deleted)                 3 lbs
          3 2     100   0.2297       0.2182        0.9022    5 lbs
          3 3     100   0.2235       0.2182        0.9533    8 lbs

          4 1     100   (this level deleted)                 60 GB Hard Drive
          4 2     100   0.2234       0.2182        0.9543    80 GB Hard Drive
          4 3     100   0.2204       0.2182        0.9806    120 GB Hard Drive

          5 1     100   (this level deleted)                 512 MB RAM 
          5 2     100   0.2199       0.2182        0.9850    1 GB RAM
          5 3     100   0.2203       0.2182        0.9809    2 GB RAM

          6 1     100   (this level deleted)                 $500
          6 2     100   0.2237       0.2182        0.9516    $750
          6 3     100   0.2222       0.2182        0.9648    $1,000


For each level, the number of times it occurs within the design is counted and provided under the column titled "Freq." Optimally efficient designs show levels within each attribute an equal number of times.

For each attribute and level, an approximation is made of the relative standard error of each main effect under aggregate analysis and assuming that each version is seen just once across the total observations. Test Design uses ordinary least squares (OLS) rather than multinomial logit for this purpose, and it uses only the information about the design of the choice tasks, rather than respondents' answers. (A multinomial logit model is used in CBC's analysis modules.) This test design method gives relative standard error estimates similar to (but not identical to) those of multinomial logit. With this test, the emphasis is not a precise estimate of each standard error for a given number of respondents, but rather the pattern of their relative magnitudes with respect to one another.

The Sample Output

We'll describe the output, using fragments of the file and showing the parts described in bold.

Each line is labeled with the attribute and level in the first columns:

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B

The number of times each level occurs in the design is displayed under the column labeled "Freq."

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B

For estimation, it is necessary to omit one level from each attribute. The first level of each attribute is automatically deleted from this analysis:

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B

The column labeled "Actual" gives estimated standard errors for the data file analyzed:

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B
          1 3      75   0.2841       0.2835        0.9958    Brand C
          1 4      75   0.2936       0.3062        1.0873    Brand D

The column labeled "Ideal" gives an estimate of what those standard errors would be if the design were precisely orthogonal and had the same number of observations:

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B
          1 3      75   0.2841       0.2835        0.9958    Brand C
          1 4      75   0.2936       0.3062        1.0873    Brand D

The column labeled "Effic" gives the relative efficiency of this design in terms of estimating each parameter, compared to the hypothetical orthogonal design (it is the square of their ratio):

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B
          1 3      75   0.2841       0.2835        0.9958    Brand C
          1 4      75   0.2936       0.3062        1.0873    Brand D

When we consider the entire "Effic" column, we see that the randomized design had a median efficiency of about 97 percent, relative to a hypothetical orthogonal design. The estimates of standard errors for orthogonal designs are only approximate, and with a very small data file such as this there can be quite a lot of variability in estimation.

Notice that the standard error estimated for attribute 1, level 4 is actually smaller than the value estimated for a hypothetical orthogonal design:

        Att/Lev  Freq.  Actual        Ideal        Effic
          1 1      75  (this level has been deleted)         Brand A
          1 2      75   0.2890       0.2887        0.9981    Brand B
          1 3      75   0.2841       0.2835        0.9958    Brand C
          1 4      75   0.2936       0.3062        1.0873    Brand D

Anomalies such as this are likely to occur when using small samples of test respondents, and shouldn't be of concern.

It is important to test the design, since if too many prohibitions are included it is possible to develop designs that do not permit estimation of desired effects. When that occurs, the estimated standard errors for those effects will be infinite and their estimated efficiencies will be zero. Your attention is called to such occurrences by the presence of asterisks instead of numbers, and/or by error messages, that signal that additional thought is needed about what can be prohibited from occurring. If you see a warning stating that your design is deficient or if you see asterisks listed for the standard errors, your design is deficient.

Especially if using few questionnaire versions, you will find that the quality of the design is affected by the design seed. You may want to try different design seeds to obtain slightly better designs.



Advanced Test

Rather than just offering a relative measure of efficiency, the Advanced Test design estimates the absolute precision of the parameter estimates under aggregate estimation, based on the combined elements of design efficiency and sample size (respondents x tasks). The estimated standard errors are only absolutely correct if the assumptions regarding the underlying part worths and the error in responses are correct. The Advanced Test is useful for both standard and complex designs that include interaction or alternative-specific effects. It also reports a widely accepted measure of design efficiency called D-efficiency, which summarizes the overall relative precision of the design.

Unlike the simpler default test, the Advanced Test takes into account that design efficiency for CBC studies depends on how the concepts are grouped in sets. The level contrasts within sets determine how much information that set contains with respect to the levels of interest. Technically, the utility balance among the concepts within those sets also affects overall design efficiency, and thus respondents' preferences need to be known to assess the efficiency of a design. However, most researchers are more comfortable planning designs that are efficient with respect to uninformative (zero) part worth utility values, and that is the approach we take.

The Advanced Test simulates random (dummy) respondent answers for your questionnaire, for as many respondents as you plan to interview. The test is run with respect to a given model specification (main effects plus optional first-order interactions that you specify).

To perform the Advanced Test, you need to supply some information:

·Number of Respondents  
·% None (if applicable to your questionnaire)  
·Included Interaction Effects (if any)  

With this information, CBC/Web simulates random respondent answers to your questionnaire. Using random respondent answers is considered a robust approach, because it estimates the efficiency of the design for respondents with heterogeneous and unknown preferences.

Simulated respondents are systematically assigned to the versions of your questionnaire (the first respondent receives the first version, the second respondent the second version, etc.). If you are simulating more respondents than versions of the questionnaire, once all versions have been assigned, the next respondent starts again with the first version. If your study includes a None alternative, then the None is selected with expected probability equal to the value you previously specified.

Once the data set has been simulated, the Advanced Test performs an aggregate logit (MNL) run, estimating the effects you selected (by default, only main effects are considered). Sample results are shown below:

Logit Report with Simulated Data  
------------------------------------------------------------  
Main Effects: 1, 2, 3, 4, 5, 6, 7  
Interactions: 1x6  
 
Build includes 300 respondents.  
 
Total number of choices in each response category:  
 
      Category Number Percent  
----------------------------------------------------  
             1    787  21.86%  
             2    753  20.92%  
             3    778  21.61%  
             4    792  22.00%  
             5    490  13.61%  
 
There are 3600 expanded tasks in total, or an average of 12.0 tasks per respondent.  
 
                       Aggregate  
         Effect        Std Err       t Ratio      Attribute Level  
  1        0.01171        0.03186        0.36757    1 1 Brand A  
  2        0.01427        0.03182        0.44850    1 2 Brand B  
  3       -0.00202        0.03195       -0.06333    1 3 Brand C  
  4       -0.02396        0.03215       -0.74517    1 4 Brand D  
 
  5        0.01118        0.02638        0.42372    2 1 1.5 GHz  
  6        0.00600        0.02639        0.22715    2 2 2.0 GHz  
  7       -0.01717        0.02654       -0.64704    2 3 2.5 GHz  
 
  8       -0.00534        0.02638       -0.20235    3 1 3 lbs  
  9        0.01096        0.02631        0.41654    3 2 5 lbs  
 10       -0.00562        0.02645       -0.21251    3 3 8 lbs  
 
 11       -0.02986        0.02655       -1.12501    4 1 60 GB Hard Drive  
 12        0.04165        0.02620        1.58962    4 2 80 GB Hard Drive  
 13       -0.01179        0.02644       -0.44580    4 3 120 GB Hard Drive  
 
 14       -0.00618        0.02644       -0.23387    5 1 512 MB RAM  
 15       -0.03820        0.02661       -1.43548    5 2 1 GB RAM  
 16        0.04438        0.02615        1.69683    5 3 2 GB RAM  
 
 17        0.05398        0.02610        2.06837    6 1 $500  
 18       -0.01099        0.02647       -0.41497    6 2 $750  
 19       -0.04300        0.02669       -1.61105    6 3 $1,000  
 
 20        0.09273        0.04975        1.86389    Brand A by $500  
 21       -0.01982        0.05101       -0.38851    Brand A by $750  
 22       -0.07291        0.05124       -1.42294    Brand A by $1,000  
 23       -0.11173        0.05083       -2.19797    Brand B by $500  
 24        0.03271        0.05040        0.64897    Brand B by $750  
 25        0.07902        0.05060        1.56170    Brand B by $1,000  
 26       -0.02233        0.05050       -0.44214    Brand C by $500  
 27       -0.00902        0.05104       -0.17680    Brand C by $750  
 28        0.03135        0.05094        0.61548    Brand C by $1,000  
 29        0.04132        0.05038        0.82021    Brand D by $500  
 30       -0.00386        0.05102       -0.07574    Brand D by $750  
 31       -0.03746        0.05164       -0.72529    Brand D by $1,000  
 
 32       -0.45916        0.04862       -9.44349    NONE  
 
The strength of design for this model is: 3,256.006  
(The ratio of strengths of design for two designs reflects the D-Efficiency of one design relative to the other.)  

Details regarding the logit report may be found in the section entitled CBC Analysis: Counts and Logit.

The beginning of the report lists the effects we are estimating (main effects for attributes 1 through 6, plus the interaction effect between attributes 1 and 6). All random tasks are included in estimation.

Next, we see that 300 respondents each with 12 tasks were simulated using random responses, taking into account the expected probability for None. We specified that the None would be chosen with 15% likelihood, and indeed the None percentage is very close to that (13.61%). If we had used more respondents, the probability would have been even closer to 15%. The remaining choices are spread approximately evenly across the four other alternatives in the questionnaire.

Next, the logit report based on the random responses is shown. You shouldn't pay any attention to the effects (utilities) in the first column, as we are using random data. The T-ratios are also not useful, for the same reason. The important column to study is the Aggregate Std Err (Standard Error) column. The standard errors reflect the precision we obtain for each parameter. Lower error means greater precision. This design included no prohibitions, so the standard errors are quite uniform within each attribute. If we had included prohibitions, some levels might have been estimated with much lower precision than others within the same attribute.

For our simulated data above, the levels within three-level attributes all have standard errors around 0.026. The one four-level attribute has standard errors for its levels around 0.032. We have obtained less precision for the four-level attributes, since each level appears fewer times in the design than for the three-level attributes. The interaction effects have standard errors around 0.051.

Suggested guidelines are:

·Standard errors within each attribute should be roughly equivalent  
·Standard errors for main effects should be no larger than about 0.05  
·Standard errors for interaction effects should be no larger than about 0.10  

The second two criteria are rules of thumb based on our experience with many different data sets and our opinions regarding minimum sample sizes and minimum acceptable precision. Ideally, we prefer standard errors from this test of less than 0.025 and 0.05 for main effects and interaction effects, respectively. These simulated data (300 respondents with 12 tasks each) almost meet that higher standard for this particular attribute list and set of effects.

D-Efficiency

D-efficiency summarizes how precisely this design can estimate all the parameters of interest with respect to another design, rather than how well the design can estimate the utility of each level of each attribute (as with the simpler default test). D-efficiency is described in an article by Kuhfeld, Tobias, and Garratt (1994), "Efficient Experimental Design with Marketing Research Applications," Journal of Marketing Research, 31 (November), 545-557.

To arrive at D-efficiency, we should define a few terms:

Xt= design matrix for task t with a row for each alternative  
xi= ith row of Xt  
pi= probability of choice of alternative i  
v= probability-weighted means of rows: v = sigmai pi xi  
Zt= matrix with ith row zi = pi1/2 ( xi - v)  
Z= matrix made by appending all Zt matrices  

Z'Z is known as the "Information Matrix"  
The determinant of Z'Z measures the strength of the design.  

Because the magnitude of the determinant of Z'Z depends on the number of parameters estimated, to provide a measure of strength independent of p we consider the pth root of the determinant:

   |Z'Z|1/p

Where Z is the probability-centered design matrix, Z'Z is the "Information Matrix," and p is the number of parameters estimated.

The pth root of the determinant doesn't result in a single value bounded by 0 and 1.0 (as with the simpler test efficiency report), and this value is meaningless without reference to the same computed for comparison design. This value also depends on the number of respondents x tasks, so when comparing two designs, it is important to hold the number of respondents x tasks constant. We use the term "efficiency" to compare the relative strengths of two designs. The relative D-efficiency of one design with respect to the other is given by the ratio of the pth root of the determinants of their information matrices. The design with a larger value is the more efficient design. (Note: we only consider the precision of parameters other than the "None" parameter when computing the strength of the design.)

Consider design A with no prohibitions and design A' with prohibitions. The pth root of the determinant of the information matrix is computed for both (holding the number of respondents, tasks, concepts per task, and None % constant). If design A' has a value of 2,500 and design A has a value of 3,000, design A' is 2,500/3,000 = 83.3% as efficient as design A. The inclusion of prohibitions resulted in a 1 - 0.833 = 16.7% loss in efficiency.

Efficiency for Specific Parameters (Attribute Levels)

Sometimes, you may be more concerned about the efficiency of the design for estimating a specific parameter (such as the utility for your client's brand) rather than an overall efficiency of the design across all parameters. Let's assume that your client asked you to implement prohibitions between the client's brand name and other levels. Further assume that the overall relative strength of the design with prohibitions relative to the design without prohibitions is 97%. On the surface, this seems like little overall loss in efficiency. However, you note that the standard error (from the logit report using simulated data) for your client's brand was 0.026 prior to implementing the prohibition, but 0.036 afterward. The relative efficiency of the design with prohibitions relative to the non-prohibited design with respect to this particular attribute level is:

   a
2/b2

Where b is the standard error of the estimate for the client's brand name after the prohibition and a is the standard error prior to the prohibition. In this case, the relative design efficiency of the prohibited compared to the non-prohibited design with respect to this particular level is:

   0.026
2/0.0362 = 0.52

And the impact of these prohibitions on estimating your client's brand utility is more fully appreciated.

Additional Note: The pattern of random answers for random respondent data will have a small, yet perceptible, effect on the reported Strength of Design (relative D-Efficiency). For the purpose of estimating the absolute standard errors of the parameters, we suggest using the same number of dummy respondents as you plan to achieve with your study. But, for comparing the merits of one design to another, you can reduce the effect of the random number seed by increasing the dummy respondent sample size significantly, such as to 5000 or 10000 respondents. Increasing the sample size will greatly reduce the variation of the Strength of Design measure due to the random seed used for respondent answers.