You’ve often seen researchers report a finding as “statistically significant.” Even if the finding is very unlikely to occur by chance, is the difference in ratings or size of the regression effect big enough to matter a hill of beans to a decision-maker?
Statistical significance doesn’t imply practical or managerial significance, it just means that some statistical finding passes some arbitrary confidence threshold (such as 95% confidence, or p<0.05). Yet, as the sample size grows and precision improves, even very tiny differences become “statistically significant.” To illustrate, a mean rating of 4.30 for segment 1 versus 4.25 for segment 2 would likely not be “statistically significant” at a sample size of n=500 per segment. But, collect a sample size of 10,000 respondents in each segment and this tiny difference may now become wildly “statistically significant.” Just because it achieves p<0.001 for this difference in ratings doesn’t mean the manager should pay any attention. Despite the impressive p-value, such a tiny difference in rating scores is not enough to be managerially significant to invest in a different strategy for the two segments.
I recently read an editorial article in The American Statistician entitled, “Moving to a World Beyond ‘p<0.05’” (Wasserstein et al. 2019). This article not only reinforced the managerially significance issue, but it called out the silliness of feeling warm and fuzzy with p=0.049, while feeling it wasn’t noteworthy if the test returned p=0.051. The authors argued that researchers should report p values as continuous variables and not be concerned about attaching dichotomous labels such as significant/not significant based on some arbitrarily chosen cutoff. They also encouraged us to embrace uncertainty and to regularly report confidence intervals rather than just point estimates.
Rather than focus on reporting significant differences (greater than zero), Wasserstein et al. suggest we should think about how big of a difference or effect would be meaningful enough to make some business decision, and to test against that modified threshold rather than against a null zero effect. For example, if for a MaxDiff experiment we felt that the preference for a new test item was large enough to make a business decision if it had at least 10% higher choice likelihood than some reference item, then this helps us define a “minimally practical effect size.” Sawtooth Software reports MaxDiff scores both on a raw logit scale and a probability scale. Given our stated goal of 10% superior choice likelihood, we conveniently can use the probability-scaled scores to conduct this type of test. We can run a t-test involving the new test item versus the reference item and rather than testing against the null hypothesis of zero difference, we’re now testing against the null hypothesis of >10% improvement.
Because many reading this article will want to see the steps involved to conduct a statistical test for managerial practicality, I’ll give an example involving MaxDiff (though the principles also apply to rating scales, CBC shares of preference, or other measures).
Imagine we interview 500 respondents using MaxDiff involving 20 items. Each respondent sees each item at least 2x or preferably 3x per respondent, so we’re comfortable that HB can provide reasonably precise estimates at the individual level. Let’s say we’ve decided that the preference for a new item is only managerially significant if it achieves at least 10% higher choice likelihood than some reference item already in market that was also included in the experiment.
Imagine the item already in market had a mean probability score from MaxDiff of 10 for the sample and the new item of interest had a mean score of 11.8. It’s 18% higher; but measured with error so we’re not certain that the new item exceeds the existing item by at least 10%. We can conduct a repeated-measures t-test, because we have a score for each item for each respondent. Imagine the data were in a spreadsheet. We simply create a new column that is the probability score for the new item minus the existing item. We copy that formula across the 500 respondent rows finding of course that the mean difference for that new column is 1.8, since 11.8 - 10 = 1.8. Next, take the standard deviation across that new column (imagine the standard deviation is 12). We then compute the standard error of that difference by dividing 12 by the square root of the sample size, or 12 / SQRT(500) = 0.537.
Normally, the t-test focuses on whether the difference between items was greater than zero by taking the difference divided by the standard error: t = (11.8 - 10) / 0.537; t = 3.35, which corresponds to 99.9% confidence. But, we’ve established a threshold of practical significance that the new item must have at least 10% higher choice likelihood than the item that’s currently in market. The existing item has a score of 10, thus the new item needs to have a score of at least 11 to be considered managerially significant. So, the null hypothesis threshold to test against is 11 rather than 10. The t-test for managerial significance is therefore (11.8 - 11) / 0.537 = 1.49, which corresponds to 86% confidence.
So, we’re 86% confident that the new item is at least 10% better than the existing item and we’re 99% confident that the new item is better than the existing item. We may consider those probabilities strong enough to proceed with a business decision involving the new item. Note that even though the test for managerial significance failed the traditional 95% confidence test, a decision-maker may still consider the evidence strong enough to proceed with the new item. There’s nothing magical about having 95% confidence, and 86% confidence is certainly better than a coin toss.
My colleague, Keith Chrzan, recently wrote a nice article on computing sample sizes needed to detect specific effect sizes for MaxDiff. So, if you’re in the planning phase for a MaxDiff experiment and need the sample size to give you enough power to detect managerially significant differences, he addresses the issue and I recommend you take a look. It may be viewed at: www.sawtoothsoftware.com/2199.
(We acknowledge that rather than using frequentist t-tests, we could use Bayesian tests involving examining the distribution of alpha draws, by transforming the draws to probability-scaled scores.)
Chrzan, Keith (2019), “Sample Size and Power Analysis for MaxDiff, Parts 1 and 2”, accessed 10/10/2019 from www.sawtoothsoftware.com/2199
Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazaer (2019), “Moving to a World Beyond ‘p<0.05’”, The American Statistician, vol. 73, NO 51, 1-19.