Have an idea?

Visit Sawtooth Software Feedback to share your ideas on how we can improve our products.

Categorical variables in CCEA


I would like to run a cluster analysis with both continuous data (age) and categorical data (5 income classes).

Is it ok to standardize all variables first and then run the CCEA analysis? I am not sure since the CCEA resorts to Euclidean distances, which is not suitable for categorical data.

Many thanks for your help!
asked Apr 17, 2015 by anonymous

1 Answer

+2 votes
This is a tricky question.  You are right that Euclidean distance is not ideal as a loss function when mixing continuous variables with categorical ones (dummy-coded).  So, although you could dummy-code your categorical variables and throw them into CCEA alongside continuous data (then standardizing), that has weaknesses.

I would think about using the Ensembles capability within CCEA (click the button to use Ensembles analysis instead of K-Means.

First, submit only your continuous variables to Ensembles analysis (and standardize them) to generate an Ensembles cluster analysis using only the continuous variables as basis variables.  An ensemble file (in .csv format) will be given to you, showing you a dozen or more different potential segmentation solutions with 2-groups, 3-groups, 4-groups, etc.

Then, dummy-code all your categorical variables (1 0 0; 0 1 0; and 0 0 1 for a 3-level categorical variable, etc.) and submit just your categorical variables to Ensemble Analysis.  Now, you have a second .CSV file that includes an ensemble of dozens of potential segmentation solutions (2-group, 3-group, 4-group, etc.).

Now, using Excel, simply combine those two ensembles files (the one from the continuous variables run with the one from the categorical variables run), so now you have a bigger .csv file with double the columns as either one of them (but the same number of row, where each row is a respondent.  Save that as your new master ensembles file, to use as input into your final ensembles consensus run.

Within CCEA, tell it that you want to run Ensembles analysis using your big custom ensembles file.  It now finds a consensus solution leveraging the big ensemble file which contains dozens of segmentation solutions from those separate initial runs.
answered Apr 17, 2015 by Bryan Orme Platinum Sawtooth Software, Inc. (174,740 points)