Clustering

NicolasC — Fri, 08 Sep 2017 10:35:08 GMT

I am using a combination of cluster node+segment profile node on a customer dataset. I have 9 interval variables and one nominal.

I have two questions:1. I used to use Python and I was using Hot Encoder to transform my nominal variable(s) into dummy indicators, so that all my variables would be numeric and a k-means algo could be used. I am wondering how SAS Miner deal with a combination of nominal and interval, and how trustworthy are the clusters/segments obtained.

2. My second question is more on the graphic side: when I check the Results from the Cluster Node, the colors of my clusters in Segment Plot Window and Segment Size Window are different. Which can be very confusing. It might be an easy change but I do not seem to sort this out myself.

Many Thanks

Nicolas

Re: Clustering

DougWielenga — Mon, 26 Feb 2018 20:08:34 GMT

I'll try and respond to each of your questions below:

I am wondering how SAS Miner deal with a combination of nominal and interval, and how trustworthy are the clusters/segments obtained.

The only way to bring categorical data into numerical algorithms is to code the values. There is not a right way or wrong way to do that because there is not right distance between red and blue (for instance). If I code red as 0 and blue as 1, they are 1 unit apart. To see what SAS Enterprise Miner is doing, you only need to look at the help by opening SAS Enterprise Miner and clicking on Help --> Contents. Then navigate in the panel on the left to

Node Reference

Explore

Cluster Node

and then click on "Coding of the Class Variables in the Cluster Node" in the panel on the right where you will see the following (excerpted -- see application help for examples):

To incorporate the class variables into the analysis, the Cluster node codes the class variables as follows:

Binary — one dummy variable is created. It contains a value of 0 or 1.

Nominal — one dummy variable is created per level that contains a value of 0 or 1.

Ordinal — one dummy variable is created for each ordinal input. The smallest ordered value is mapped to 1, the next smallest ordered value is mapped to 2, and so on.

Please note that this is a common way to code categorical variables. The coding does not impact how 'trustworthy' the segments are since there is no correct distance between red and blue (for example). Clustering was designed for numerical data but like many such methods, it can be adapted for categorical data. If you only have one categorical variable, it might be better to consider building a cluster solution of your numerical variables for each level of your categorical variable. There is no right or wrong cluster solution -- just solutions that are more helpful or less helpful based on your analysis goals. The decision about the best approach is therefore a judgement call for the analyst.

I personally don't like to use categorical variables in clustering since the clusters don't tend to neatly resolve categories which can muddle interpretation. Instead of having a clean break of red and blue into different clusters, you might end up with 80% red and 20% blue in one cluster and 20% red and 80% blue in another. This does not mean that the solution isn't useful -- it just might be more difficult to interpret than two separate cluster solutions - one done for the blue observations and one done for the red observations.

The colors of my clusters in Segment Plot Window and Segment Size Window are different

The colors in the Segment Plot window correspond to ranges of values for the variable in question in each segment. You can click on one of the small squares in the legend in order to see the proportion of that range of values which appears in each segment. The colors in Segment Size window correspond to the segment itself, not the range of values for a variable for a particular segment. As a result, there is no expectation that the colors should coincide. The Segment Plot window is easier to understand if you click through the squares representing ranges of values so that you can see where those values are highlighted in the chart itself.

Hope this helps!

Doug

topic Re: Clustering in SAS Data Science

Clustering

Re: Clustering