How to define/ensure a Minimum Cluster Size

mad_scientist · Posted 06-17-2018 11:14 PM

My intention is to perform clustering in SAS Enterprise Miner 14.2 on three continuous variables, and use the resulting clusters as customer segments. The Enterprise Miner Cluster node has plenty of criteria around the number of clusters, but I'd like to set a minimum number of observations per cluster, because I don’t want an outlying customer having their own segment. Let’s say the minimum number of observations per cluster should be 10.

The documentation doesn't seem to describe how to place a hard constraint on minimum observations per cluster. My first thoughts are:

write SAS code to recode the cluster result for the clusters with few observations. I guess I’d have to calculate Euclidean distance between each of the relevant observations and each cluster centre and recode on that basis, OR
transform the data prior to clustering to bring in the outliers and thus lessen the likelihood of them forming their own clusters.

Any better ways ? Thanks.

DougWielenga · Posted 11-21-2018 01:47 PM

My intention is to perform clustering in SAS Enterprise Miner 14.2 on three continuous variables, and use the resulting clusters as customer segments. The Enterprise Miner Cluster node has plenty of criteria around the number of clusters, but I'd like to set a minimum number of observations per cluster, because I don’t want an outlying customer having their own segment.

What you are encountering is a common problem where the results of a cluster analysis yields one large "blob" and perhaps several outlying clusters that only have a few observations each. These clusters might represent emerging opportunities that might not warrant individual treatment yet or they might just be unusual or inaccurate data. If your goal is to better identify non-trivial clusters, you might consider one of several approaches I'll suggest below, but please understand that there is not a 'right' answer to this -- it really depends on your business problem and/or objectives in the analysis. Also note that these represent my personal opinion as an analyst and other analysts might feel very differently.

1 - Consider reducing the number of variables: Given that you only have three continuous variables,this is not likely to help as much but consider plotting each pair of the three variables against one another to get a good look at the patterns that exist in two dimensions. You might conclude you can bin one of the three variables and then cluster the observations on the other two variables for each bin of the binned variable. Alternatively, consider looking at a 3-dimensional plot to really visualize what is happening. Either way, you will likely see a way to improve your overall clustering since there are so few dimensions.

2 - Consider transforming your input variables, but be cautious: The problem with transformations is that they sometimes complicate the interpretation. For example, it's easier to think in terms of DOLLARS then LOG(DOLLARS). Since clustering is about interpretation, it can be problematic once you transform. Another approach would be to bin your observations based on their univariate distributions (into groups of values that are relatively close together, not on quantiles or fixed widths) and then creating cluster profiles based on the bins they fall into for each. For example, suppose you have 3 bins for RECENCY, 5 bins for FREQUENCY, and 4 bins for AMOUNT. Each observation could then be classified based on a 3-digit code that held their 'profile' for the three variables. The composite profile '251' could represent observations with bin 2 on RECENCY, bin 5 on FREQUENCY, and bin 1 on AMOUNT. You will likely find that the majority of your observations fall into a small set of the overall 'profiles' which can lead to useful interpretation.

3 - Consider alternate analysis methods: Multi-dimensional scaling (MDS) attempts to take a higher number of dimensions and represent it in a two-dimensional plot. You might see some interesting patterns that emerge that help you see the structure in the data (or see that there isn't any!). You might also consider just building some principal components and then clustering on the principal component scores (which is similar two what MDS does). Of course, there is no guarantee how interpretable the PCs will be but you might find it a useful approach to finding structure in the data, particularly if the variables are not orthogonal since the PCs that will be created are.

4 - Consider alternate and (potentially) limited number of clusters: You only have three dimensions which could create more than three clusters but perhaps not as many clusters as you are looking to consider. By default, SAS Enterprise Miner will create a set of cluster seeds and use those to create the initial cluster solution. It then will cluster those cluster seeds until a smaller number of clusters is created. Starting with fewer cluster seeds should improve the outlier effect you described.

5 - Consider removing the outlier clusters and re-clustering the 'blob': Admittedly, I would be lambasted by some for suggesting we "throw away data" but I'm not saying to do so permanently. Remove the outliers that are forming their own clusters and re-cluster the 'blob' in the middle. You will likely encounter more clusters of outliers and so you might need to do this more than once. At some point, you will have a cluster solutions for the 'central' observations and you can then score the entire data set -- thereby assigning the outliers ignored earlier to one of the 'central' clusters. Since our goal is to get non-trivial sized clusters, this might be the only way to do it.

Again, just thoughts on what I might do but there are likely varying (and perhaps strongly different) opinions on any of these approaches.

Hope this helps!

Doug

How to define/ensure a Minimum Cluster Size

Re: How to define/ensure a Minimum Cluster Size

Registration is open