08-27-2015 12:25 PM
I have a large sparse data set and I would like to apply segmentation of my customers. To give you an idea, I have more than 100 variables and 2.2 mln rows. Breakdown of my variables are as follows:
Since my data is sparse, I would like to use a density based approach for clustering my data. I expect that the shape of each cluster would be different. So a bit of research revealed that I should be using PROC MODECLUS but given the sparsity of the data, I need to use PROC DISTANCE to attain a distance measure to my data. The data covers the product and services that a customer gets and missing values indicate that this customer didnt receive any of those services. I would like to obtain a better clustering than the one purely looks at having this product or not. (I mean i dont want to have 18 different clusters identified by the each of these 18 products)
So my question is, under these circumstances, what options I should be choosing in PROC distance to get a nice clustering in PROC MODECLUS? I have tried NOSTD option with MISSING in the variables but it didnt give me anything credible.
Thanks a lot in advance