07-01-2015 10:43 AM
I have a dataset which has about 1000 variables (all are numerical) and is unsupervised(has no target variable). It has a column "zipcode" and my goal is to form meaningful clusters based on this dataset to analyze the association between the zip codes . I was looking to reduce the number of variables (dimensionality reduction) so that I can pass the reduced dataset to PROC Varclus . Is there any effective Procedure for dimensionality reduction for unsupervised datasets? I am using Enterprise Miner and Enterprise Guide. Any related response would be of great help. Thankyou!
07-01-2015 12:47 PM
1000 inputs do not seem like a lot, so I think you are good to use the Cluster or HPCluster nodes just on those inputs. I am not very clear on what are you planning to do with the zip codes. Were you planning to run a cluster node on your 1000 inputs and then compare those clusters to your zip codes? Or what was your plan?
You can use the Variable Cluster and the Principal Component nodes in Enterprise Miner for dimension reduction but I am not sure that you need that.
07-01-2015 02:02 PM
Thank you for your reply. Yes, I was planning to run the cluster node on 1000 inputs and then compare/map the observations with the respective zip codes. FYI, each observation is identified by a unique zip code. This is the only method that I could guess. Is there any other efficient method or procedure for dimensionality reduction in an unsupervised dataset other than using the Cluster node?