Solved: Re: Categorical inputs and standadisation in Cluster Analysis

pvareschi · Posted 05-11-2020 05:10 AM

Re: Applied Analytics Using SAS Enterprise Miner

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

gcjfernandez · Posted 05-12-2020 11:47 AM

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")

My Answers:

For K-means and Hierarchical clustering interval variables are recommended. SAS HP cluster node also can perform ABC clustering based on Manhattan distance. For this option you can also include dummy variables from a categorical var.
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

My answer:

For K-mean clustering and PCA , Z-standardization is preferred. For some special NN machine learning algorithm Range-normalization may be preferred.

View solution in original post

gcjfernandez · Posted 05-12-2020 11:47 AM

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")

My Answers:

For K-means and Hierarchical clustering interval variables are recommended. SAS HP cluster node also can perform ABC clustering based on Manhattan distance. For this option you can also include dummy variables from a categorical var.
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

My answer:

For K-mean clustering and PCA , Z-standardization is preferred. For some special NN machine learning algorithm Range-normalization may be preferred.

Categorical inputs and standadisation in Cluster Analysis

Re: Categorical inputs and standadisation in Cluster Analysis

Re: Categorical inputs and standadisation in Cluster Analysis

SAS Training: Just a Click Away