Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

Measuring cross validity & over-fitting in Latent Class


After running the Latent Class analysis, the output does not show a clear number of groups (segments) that fit best with my data.
The CAIC and BIC keep decreasing when the number of groups increases. Though, the relative Chi Square is highest at the two group solution.

What do I have to do?

I heard it is possible to do a cross-validation and/or test for overfitting. But how does this work in Sawtooth? And how do I have to interpret the output?

I hope you can help me out.

Thank you in advance!
asked Feb 16, 2017 by anonymous

1 Answer

+1 vote
Lower CAIC and BIC values are actually better.  

You should ask yourself why you are running latent class.  What's the goal?  Interpretability of the groups?  Predictive validity based on the choice simulator you could build using latent class utilities?

Most of our users tend to develop latent class solutions mainly for the strategic purposes of finding a finite number of segments to explain the market (in a simple enough way that humans can easily grasp) and maintain for target marketing and managerial purposes.  

Most of our users employ HB estimation to estimate the individual-level scores that are used in the market simulator to make predictions about product choice.

It's really nice to use HB to predict individuals' choices in the simulator, but to break out those results by segment membership as separate columns (as "banner points").
answered Feb 16, 2017 by Bryan Orme Platinum Sawtooth Software, Inc. (133,765 points)
Thank you for your answer.

Though, I still have some questions.

I indeed want to develop classess for strategic purposes. But now Latent Class shows me that 7 groups would have the lowest CAIC and BIC, but it is not desirable to have 7 segments to interpret. I don't know what to do with this.

What is possible within Sawtoot regarding cross-validation and overfitting? I think the algorythm of my data is too complicated.

Thank you in advance.
With strategic segmentation via cluster or latent class, you want to find a balance between:

1.  Interpretability (the solutions make sense from a story-telling standpoint).
2.  Large enough segments (the target segments you are interested in are large enough such that it is meaningful to separate them and you have enough people in each target segment to tabulate their responses to the other questions in your survey so you can understand their motivations and characteristics).
3.  Stability (if you repeat the latent class segmentation from different random starting points, you'll get essentially the same result every time...meaning it's not a lucky find).
4.  Targetability (the segments you are interested in are targetable in ways meaningful to your business...you can create a product that reaches them, you can identify who they are to advertise/promote to them).
5.  Statistically meaningful fit (the BIC or CAIC criteria are low).

Obviously, not every segmentation solution (2-group, 3-group, etc.) is equally good in all these aspects.  There is both art and science to strategic market segmentation and you often have to trade off aspects of one goal to reach the other.