Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Clustering binary data with Enterprise Miner

Reply
N/A
Posts: 1

Clustering binary data with Enterprise Miner

Dear community,

at a project at our university we are trying to cluster binary data.

Therefore we have an excel-sheet in the followig format:

C1 | C2 | C3 ...

0  |   1  |   0

0  |   0  |   1

1  |   1  |   0

With 1 = yes (it applies) and 0 = no (doesn't apply). Each line could be understood as a shopping basket where a product is bought or not.

Now we would like to have clusters alá C1 and C2 are often paired, C3, C4, C5 build one cluster etc.

It's no problem to import the excel-sheet, set those varibales to 'binary' and make a clustering with the Cluster Node. The problem is, that the results just don't make any sense. The clusters are based on 0 and 1, not the attributes. Are there any options in the Import or the CLuster Node that have to be set in order for the Enterprise Miner to interprete and cluster binary data meaningfully? We just can't find any.

We would much appreciate any help since this problem drives us crazy.

Best regards,

Sonnfan.

Community Manager
Posts: 485

Re: Clustering binary data with Enterprise Miner

Hi Sonnfan,

Thanks for posting. I'm sure many community members can relate to a problem that drives them crazy. I'm looking into it here at SAS and we will respond more specifically soon.

Anna

SAS Employee
Posts: 31

Re: Clustering binary data with Enterprise Miner

HI Sonnfan, 

It appears you want to cluster variables and not observations.  In that case, you can use variable clustering node, or
factor analysis (see PROC FACTOR) or princicpal components.   If you want to cluster rows, the for binary data, the
Euclidean distance measure used by K-Means is equivalent to counting the number of variables on which two cases disagree. However, you can try one of the
following approaches:

  1. Run proc distance by a selecting the distance type
    that you want and apply clustering using that distance matrix.

OR

  1. You can project the binary variables and do
    clustering as follows:
  2. run Factor Analysis or PCA on the binary
    variables
  3. save the factor or component scores as new
    variables
  4. cluster on the basis of those scores (In that
    case, the data will no longer be binary)

Good luck

Ask a Question
Discussion stats
  • 2 replies
  • 878 views
  • 0 likes
  • 3 in conversation