turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Clustering binary data with Enterprise Miner

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-19-2013 07:44 AM

Dear community,

at a project at our university we are trying to cluster binary data.

Therefore we have an excel-sheet in the followig format:

C1 | C2 | C3 ...

0 | 1 | 0

0 | 0 | 1

1 | 1 | 0

With 1 = yes (it applies) and 0 = no (doesn't apply). Each line could be understood as a shopping basket where a product is bought or not.

Now we would like to have clusters alá C1 and C2 are often paired, C3, C4, C5 build one cluster etc.

It's no problem to import the excel-sheet, set those varibales to 'binary' and make a clustering with the Cluster Node. The problem is, that the results just don't make any sense. The clusters are based on 0 and 1, not the attributes. Are there any options in the Import or the CLuster Node that have to be set in order for the Enterprise Miner to interprete and cluster binary data meaningfully? We just can't find any.

We would much appreciate any help since this problem drives us crazy.

Best regards,

Sonnfan.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-20-2013 02:27 PM

Hi Sonnfan,

Thanks for posting. I'm sure many community members can relate to a problem that drives them crazy. I'm looking into it here at SAS and we will respond more specifically soon.

Anna

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-20-2013 04:10 PM

HI Sonnfan,

It appears you want to cluster variables and not observations. In that case, you can use variable clustering node, or

factor analysis (see PROC FACTOR) or princicpal components. If you want to cluster rows, the for binary data, the

Euclidean distance measure used by K-Means is equivalent to counting the number of variables on which two cases disagree. However, you can try one of the

following approaches:

- Run proc distance by a selecting the distance type

that you want and apply clustering using that distance matrix.

OR

- You can project the binary variables and do

clustering as follows: - run Factor Analysis or PCA on the binary

variables - save the factor or component scores as new

variables - cluster on the basis of those scores (In that

case, the data will no longer be binary)

Good luck