Categorical inputs and standadisation in Cluster Analysis

pvareschi — Mon, 11 May 2020 09:10:35 GMT

Re: Applied Analytics Using SAS Enterprise Miner

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

Re: Categorical inputs and standadisation in Cluster Analysis

gcjfernandez — Tue, 12 May 2020 15:47:40 GMT

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

My Answers:

For K-means and Hierarchical clustering interval variables are recommended. SAS HP cluster node also can perform ABC clustering based on Manhattan distance. For this option you can also include dummy variables from a categorical var.
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

My answer:

For K-mean clustering and PCA , Z-standardization is preferred. For some special NN machine learning algorithm Range-normalization may be preferred.

topic Categorical inputs and standadisation in Cluster Analysis in SAS Academy for Data Science

Categorical inputs and standadisation in Cluster Analysis

Re: Categorical inputs and standadisation in Cluster Analysis