Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

RLH and Pct-Certainty

Dear Sawtooth Team,

I have created a CBC and only use pictures and no verbal description. My master thesis deals with using CBC in an unusual field, namely schedules of students. Since there are many dependencies that lead to many prohibitions, I allow violations of attribute levels (these are defined for all days equally) at some days to a small extent (which might cause problems with the goodness-of-fit). I have 15 tasks (10 random, 2 reliability holdouts, 3 validity holdouts) and per task two concepts are compared (no none option available). I have 5 attributes with 2 levels each. When I run HB I get the following average goodness-of-fit values:

Pct-Cert: 0.487
RLH: 0.701
(153 participants)

Question 1: With two concepts, the worst possible RLH value is 0.5, so 0.7 is acceptable. In one study Orme showed that Pct-Cert. should be at least 0.6, so I assume that there are problems with my study design?

As a result, I have removed unreliable participants through the holdout tasks and the values have improved as follows:

Pct-Cert: 0.528
RLH: 0.721
(124 participants)

Since this is still not satisfactory, I have eliminated participants who took less than 2 minutes (is appropriate for my study, as only images are compared). The values have improved as follows:

Pct-Cert: 0.552
RLH: 0.733
(111 participants)

The same I have repeated with 3min

Pct-Cert.: 0.578
RLH: 0.747
(99 participants)

I also tested my orthogonality study with the preliminary counting test and the logit efficiency test and got optimal values. However, I have seen through an aggregate logit test that two attributes have standard error of about 0.6.

Question 2: Can I conclude from this that my images may not be representative for the levels (this is certainly the case because, as mentioned above, violations were allowed for some images) and therefore my approach of deviations shows problems?

Question 3: What are optimal values for Avg. Variance and RMS? Are they as important as Pct-Certainty?

Question 4: In the next step I'll calculate the hit rate. Am I right to assume that based on my findings, the hit rate might be low?

Thank you very much!
asked Sep 11, 2019 by Kristin

1 Answer

0 votes
The norms I reported for Percent Certainty were based on 25 CBC studies I had on hand.  But none of them used just pairs (2 concepts) per task!  So, you shouldn't take what I said about these other CBC studies to the bank regarding expected norms for Percent Certainty when using CBC with pairs.

When cleaning noisy respondents, I like to use a combination of speeding and RLH fit.  When people tend to speed and tend to have relatively low (compared to the rest of the respondents) fit, then I tend to clean them.  It's not unusual in today's data collection world for CBC to want to clean 15% to 25% of the sample.

If you are running aggregate logit estimation for your CBC study and getting standard errors for main effects of 0.6, then this is not good precision.  We want standard errors from aggregate logit to be around 0.05 or less.  Did you use some prohibitions that harmed your design efficiency?

There aren't optimal values for variance or RMS parameter size.  It just depends so much on the heterogeneity that exists in the sample.  Heterogeneity by itself isn't a good or bad thing: it's just a characteristic of the population.
answered Sep 11, 2019 by Bryan Orme Platinum Sawtooth Software, Inc. (172,790 points)
Thanks Bryan! Did I get it right that for two concepts you cannot state a minimum value which should be achieved? The RLH is quiet satisfying or?

I need to clean 29 of 153 respondents which are not reliable, this is already 19%.
Step1: To clean respondents with a low RLH, I first need to conduct a HB with the reliable data right? Then I export the utility file and see the individual RLH of the reliable data.
Step 2: Based on this I clean all respondents which have a RLH <0.579 (https://www.sawtoothsoftware.com/help/lighthouse-studio/manual/hid_web_maxdiff_badrespondents.html), and have a total time < 2min.
Step 3: Then I condct a HB again with reliable data with good RLH and enough time.
Step 4: Based on the new RLH's of step 3, I can calculate the average RLH and average Pct-Certainty as you described in another forum. Or do I therefore need to use the "old" RLH of step 1?
After conducting step 1-2 I have eliminated 29 unreliable respondents and 7 additional respondents having RLH<0.579 and time<2min. So in total I excluded 23.5%.
Is this approach ok?

I have two prohibitions and therefore the number of possible profiles decreased from 32 to 30. But the preliminary counting analysis showed excellent results for frequencies and simulated standard errors (I have compared it to the same design using no prohibitions and the outputs are the same). But still I used pictures instead of verbal descriptions and allowed for some "deviations" of attribute levels, to avoid more prohibitions. Normally I would have needed 10 prohibitions but due to the permitted deviations of attribute levels to a certain degree, I could decrease the number to two prohibitions. The deviations of attribute levels might have caused higher standard error because the pictures are not 100% representative for the attribute levels.  Is this right?

Thank you so much!
That paper you are referencing is about MaxDiff, not CBC.  You cannot use those guidelines we published regarding RLH for MaxDiff for "bad respondents" to apply to your CBC study.  MaxDiff is a different animal.

Again, we haven't published any norms regarding expected RLH or Percent Certainty for CBC studies that use just 2 concepts per task.

To clean respondents with relatively low RLH, you run HB.  Then, after cleaning, you run HB again only involving the remaining respondents who were not cleaned.  To get a feel for what RLH you might expect from HB analysis on random respondents, you could create a copy of your project (using File + Save As) and generate a few hundred respondents using the data generator (random responding people).  Then, run HB estimation on that different project that has random respondents.  Then, look at the RLH for random responders.

If you ran the test design with your two prohibitions and the expected sample size (similar to your final sample size) and got standard errors around 0.05 or less, then it really surprises me that with that approximate sample size and real data that you would see standard errors for aggregate logit of 0.6.  Something is wrong with that.  It just seems impossible to me, because the standard error is much more reliant on the experimental design rather than the respondent answers.

Whether you used pictures to represent the levels or text should not have affected the standard errors of the aggregate logit estimates much at all.

I'm sorry, but I don't understand what you mean by your text:  "and allowed for some "deviations" of attribute levels, to avoid more prohibitions. Normally I would have needed 10 prohibitions but due to the permitted deviations of attribute levels to a certain degree, I could decrease the number to two prohibitions."
Thank you! To summarize it: The 10% of the participants with the lowest RLH (RLH<0.5752), who additionally required a total time of less than 2min, were eliminated (seven respondents). By excluding the unreliable participants, and the seven participants, I achieved a Pct-Certainty of 0.568 and a RLH of 0.741. All the standard errors are slightly below 0.5, I might have made a mistake earlier.

Now I have a trade-off: For alle completed participants including the unreliable, I have good standard errors (around 0.038) but less good model fit (Pct-Certainty 0.487, RLH 0.701). By including only reliable participants I have standard errors of around 0.4 and better model fit (Pct-Certainty 0.528, RLH 0.721). By eliminating the unreliable respondents and the additional seven respondents I have standard errors slightly below 0.5 but better model fit (Pct-Certainty 0.568 and RLH 0.741). Do you have an advice which data I should use for estimating part worth values, importances and so on?
I'm confused because at one point you say that you have standard errors from aggregate logit for all completes of 0.038.  But, by deleting a few respondents you now have standard errors of 0.4.  That's just not possible to my experience.  To halve standard errors takes a quadrupling of the sample size.  I'm wondering if you didn't mean to say 0.04 rather than 0.4?
I'm sorry Bryan, I should have read carefully what I typed again. I meant 0.04. Again:

Dataset 1: 153 respondents (complete);  Pct-Certainty 0.487, RLH 0.701; Standard errors of the five attributes: 0.039, 0.038, 0.039, 0.037, 0.042

Dataset 2: 124 respondents (complete & reliable); Pct-Certainty 0.528, RLH 0.721; Standard errors of the five attributes: 0.043, 0.042, 0.044, 0.041, 0.047

Dataset 3: 117 respondents (complete & reliable reduced by seven earlier mentioned respondents); Pct-Certainty 0.568 and RLH 0.741; Standard errors of the five attributes: 0.046, 0.043, 0.046, 0.043, 0.049

Since the goal of HB is not to maximize RLH or Pct-Certainty, and the standard errors are not reduced by eliminating further respondents, I would say that it makes no sense to reduce the number of respondents to 117. Therefore, I planned to base my analysis on Dataset 2. What would you suggest?
OK, there's a big difference between 0.04 and 0.4 when we're discussing standard errors!  Thanks!

Keep in mind: standard errors for aggregate logit will always tend to DECREASE as you add more sample size, even if those additional respondents you're adding are garbage, random respondents.  So, the conclusion shouldn't be that it's better to have 153 respondents with lower aggregate logit standard errors per attribute than a cleaner (at the individual level) set of 117 respondent with higher aggregate logit standard errors.

Some cleaning, to delete speeders or respondents who are barely better than random responders, is usually helpful.  But, how deep to cut and whether you are throwing away baby with bathwater is not always clear.
Sorry, in the post above I accidentally said "increase" when I meant to say DECREASE in the second paragraph above.  I've corrected it in the text above and put it in caps.
Thank you so much, you helped me a lot!
...