Pride and Prejudice and Cluster Analysis

It isn't often that one can combine a charming 19th century comedy of manners with a hard-edged 21st century discussion of segmentation methodology. As it turns out, this isn't one of those times.

Back in the late 1980s Convergent Cluster Analysis (CCA) became the first Sawtooth Software product I used. I had my own solutions for conjoint analysis and perceptual mapping but I was dissatisfied with the cluster analysis packages I had, including those in SAS and SPSS. All the K-means programs suffered from the same problem: different starting seeds could produce different answers, so unless you ran lots of solutions you couldn't tell if you were picking up robust structure in the data or just an outlier solution. While some folks might like the freedom of getting a variety of answers, it gives analysts a lot of wiggle room to "make stuff up." Being the kind of marketing scientist who valued finding reliable, repeatable marketing segments instead of ephemeral marketing figments, making stuff up was exactly what I wanted to avoid, a dereliction of duty even. Of course running lots of K-means solutions with different starting points, crosstabbing them and looking for consistency made for a lot of work and several boring manual steps.

When I first saw CCA it was love at first sight. Now I had a cluster analysis package that automated K-means cluster analysis with 10 sets of starting seeds, selected the most robust solution for a given number of clusters and reported reproducibility statistics. As time went on and my old 486 machine became a Pentium machine I was able to convince Sawtooth Software to make me a custom version of this K-means done right that didn't limit me to just 10 replications. While other folks were using just 10 replications I could run up to 99.

At this point I was ready to be lazy: the world of conjoint analysis continued to evolve, as choice-based conjoint replaced ratings-based conjoint, as experimental design theory advanced and as HB analysis joined aggregate MNL modeling in the toolkit. It was enough to make your brain tired. I was pretty sure CCA and its lightning fast use of replications to find robust solutions was as good as I was going to get in the world of cluster analysis. I was happy to have one area where I could relax, confident in the knowledge that the methods we currently used worked well.

But it wasn't to be: one day Ming Shan and Joe Retzer, two bright marketing scientists on my staff at Maritz Research, brought a new method called cluster ensembles to one of our internal training meetings. They showed that it could out-cluster CCA and they laid waste to my comfortable belief that CCA was an endpoint in cluster analysis evolution. Joe and Ming introduced cluster ensembles analysis to the marketing research community at the 2007 Sawtooth Software Conference. Their work made an impression because Sawtooth Software came up with their own take on how to do cluster ensembles. Convergent Cluster Ensembles Analysis (CCEA), the result of this effort, combines an improved version of the old CCA program with an easy to use convergent version of cluster ensembles analysis.

Today many more options exist for the analyst wanting to make clusters. R has any number of cluster analysis packages, latent class analysis allows some convergence testing (albeit at the cost of often lengthy run times) and even the general purpose statistics programs upped their game with many more options for running their cluster analyses. Still, for finding a robust solution, very easily and in a short amount of time, CCEA is impossible to beat. The fact that it allows custom user-specified ensembles enables a user to import her favorite solutions from R, or SPSS or latent class analysis, perhaps in combination with CCA solutions, and load all of them into a mega-ensemble analysis. An analyst running CCEA need not fear that he's failed to do his due diligence.

Keith Chrzan
Keith Chrzan
Senior Vice President
Sawtooth Analytics