08-19-2013 07:44 AM
at a project at our university we are trying to cluster binary data.
Therefore we have an excel-sheet in the followig format:
C1 | C2 | C3 ...
0 | 1 | 0
0 | 0 | 1
1 | 1 | 0
With 1 = yes (it applies) and 0 = no (doesn't apply). Each line could be understood as a shopping basket where a product is bought or not.
Now we would like to have clusters alá C1 and C2 are often paired, C3, C4, C5 build one cluster etc.
It's no problem to import the excel-sheet, set those varibales to 'binary' and make a clustering with the Cluster Node. The problem is, that the results just don't make any sense. The clusters are based on 0 and 1, not the attributes. Are there any options in the Import or the CLuster Node that have to be set in order for the Enterprise Miner to interprete and cluster binary data meaningfully? We just can't find any.
We would much appreciate any help since this problem drives us crazy.
08-20-2013 02:27 PM
Thanks for posting. I'm sure many community members can relate to a problem that drives them crazy. I'm looking into it here at SAS and we will respond more specifically soon.
08-20-2013 04:10 PM
It appears you want to cluster variables and not observations. In that case, you can use variable clustering node, or
factor analysis (see PROC FACTOR) or princicpal components. If you want to cluster rows, the for binary data, the
Euclidean distance measure used by K-Means is equivalent to counting the number of variables on which two cases disagree. However, you can try one of the