BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Applied Analytics Using SAS Enterprise Miner

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

 

1 ACCEPTED SOLUTION

Accepted Solutions
gcjfernandez
SAS Employee

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")

My Answers:

For K-means and Hierarchical clustering  interval variables are recommended. SAS HP cluster node also can perform ABC clustering based on Manhattan distance. For this option you can also include dummy variables from a categorical var.
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

My answer:

For K-mean clustering and PCA , Z-standardization is preferred. For some special NN machine learning algorithm Range-normalization may be preferred.

View solution in original post

1 REPLY 1
gcjfernandez
SAS Employee

I have a couple of questions on Cluster Analysis (chapter 8 of course notes):

1. In what scenarios should categorical variables, via dummy indicators, be used for Clustering? Or would it just be better to use interval variables as suggested by the course notes at page 8-9? ("An interval measurement level is recommended for k-means to produce non-trivial clusters")

My Answers:

For K-means and Hierarchical clustering  interval variables are recommended. SAS HP cluster node also can perform ABC clustering based on Manhattan distance. For this option you can also include dummy variables from a categorical var.
2. In what instances would a Range Standardisation (with reference to property "Internal Standardization") be recommend in place of the usual standardisation (i.e. subtracting the mean and dividing by the standard deviation)?

My answer:

For K-mean clustering and PCA , Z-standardization is preferred. For some special NN machine learning algorithm Range-normalization may be preferred.