Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

k-Means vs. Consensus Clustering in CCEA

Hello Sawtooth Community,

I have a question regarding the CCEA tool:

If I want to extract a cluster solution for my dataset, e.g., distance-based k-means with 4 groups, I have two possibilities:

1.) Select k-Means clustering in CCEA and extract solution

2.) Select consensus clustering in CCEA and extract the distance-based k-means with 4 groups from the resulting ensemble-file

However, when comparing both solutions, it becomes apparent that there are differences between them. Why is that and which strategy would you recommend?

Thank you very much for your answer and help
asked Feb 26, 2015 by anonymous

1 Answer

0 votes
There are 100s of different clustering algorithms, so when you use different algorithms, you often obtain different results.

From a statistical standpoint, we have found that the cluster ensemble approach is superior to traditional K-means for recovering "true" structure from datasets built with known true group membership.  Our CCEA Technical paper describes those experiments.

But, cluster segmentation analysis is a lot of art in addition to the science.  You have to find a segmentation scheme that is useful from a managerial perspective.  It is quite possible that inferior clustering methods (from a statistical standpoint) may for a particular data set give a result that to the human eye seems more managerial useful.

There are also different pre-processing steps you can do that affect the final solution: choice of input variables, choice to standardize or center input variables.  So, if you don't like a particular solution, you can think about whether a different pre-processing step is justified or hopefully a better way to prepare the data.
answered Feb 26, 2015 by Bryan Orme Platinum Sawtooth Software, Inc. (131,390 points)
Thank you very much for your quick response.

However, there is still an open question:

So far, I told my audience that I built the ensemble based on the most reproducible k-means solutions (as obtained by selecting "k-means, most reproducible solution" from CCEA).

Today, however, I found that the 3-group distance-based solution in the ensemble file (as created by CCEA and consensus option) is not similar to the one that is the most reproducible among the 30 replications (as obtained from the "k-means" option).

Therefore, I am not sure how to explain the audience my input for the ensemble file since it seems that (although a fixed seed is used), the ensemble as created by CCEA does not comprise the most reproducible k-means solutions.

Thanks again for you input and answer.
The ensemble file that CCEA creates (which it uses to compute a consensus solution) contains many candidate cluster solutions as created using a variety of methodologies (described in the manual).  Ensembles of candidate solutions work really well if those segmentation solutions in the ensemble were computed using many diverse methods, such as k-means from different starting points, different types of hierarchical clustering routines, etc.  That is to say that diversity in the candidate segmentation solutions in the ensemble contribute to the strength of the final consensus solution.  

This may seem strange to say, but the absolutely most reproducible K-means solution (obtained by comparing 30 solutions from different starting points, as our traditional K-means algorithm does within the CCEA software) may or may not necessarily be a good contributor within the cluster ensemble.  Most people very familiar with ensembles would say that using all 30 K-means solutions rather than just the one best replicate of those 30 within the ensemble would probably be more beneficial.  There is value to having diverse candidates in the ensemble, even though each one may be in some ways suboptimal.

If you want the most reproducible K-means solution from your K-means run to be included in the ensemble (it certainly shouldn't hurt), then you can certainly modify the ensemble file and add it as a new column.  Then, point CCEA software to your modified ensemble file by clicking "Use custom ensemble file".