Dear community,
at a project at our university we are trying to cluster binary data.
Therefore we have an excel-sheet in the followig format:
C1 | C2 | C3 ...
0 | 1 | 0
0 | 0 | 1
1 | 1 | 0
With 1 = yes (it applies) and 0 = no (doesn't apply). Each line could be understood as a shopping basket where a product is bought or not.
Now we would like to have clusters alá C1 and C2 are often paired, C3, C4, C5 build one cluster etc.
It's no problem to import the excel-sheet, set those varibales to 'binary' and make a clustering with the Cluster Node. The problem is, that the results just don't make any sense. The clusters are based on 0 and 1, not the attributes. Are there any options in the Import or the CLuster Node that have to be set in order for the Enterprise Miner to interprete and cluster binary data meaningfully? We just can't find any.
We would much appreciate any help since this problem drives us crazy.
Best regards,
Sonnfan.
HI Sonnfan,
It appears you want to cluster variables and not observations. In that case, you can use variable clustering node, or
factor analysis (see PROC FACTOR) or princicpal components. If you want to cluster rows, the for binary data, the
Euclidean distance measure used by K-Means is equivalent to counting the number of variables on which two cases disagree. However, you can try one of the
following approaches:
OR
Good luck
Hi Sonnfan,
Thanks for posting. I'm sure many community members can relate to a problem that drives them crazy. I'm looking into it here at SAS and we will respond more specifically soon.
Anna
HI Sonnfan,
It appears you want to cluster variables and not observations. In that case, you can use variable clustering node, or
factor analysis (see PROC FACTOR) or princicpal components. If you want to cluster rows, the for binary data, the
Euclidean distance measure used by K-Means is equivalent to counting the number of variables on which two cases disagree. However, you can try one of the
following approaches:
OR
Good luck
Clustering attempts to create groups (or clusters) out of observational data which has no inherent groups. In many cases, analysts produce one cluster solution but don't take into account that clusters formed on a large set of variables is often driven by a small set of those variables. For example, if I have already clustered automobiles based on engine size, highway miles per gallon, and weight, there is less variability left to explain by any subsequent variable such as average retail cost. If you add in categorical variables such as Country of Origin and Drivetrain to the input variables for clustering, you end up having categories split across clusters making them less informative in understanding the structure of the data.
While clustering methods have been generalized to allow for categorical data, it is sometimes overlooked how the categorical data itself provides a natural grouping of the data. Sometimes, a simple descriptive technique will be far more informative given all of your inputs are binary.
Instead of constructing your data
C1 | C2 | C3 ...
0 | 1 | 0
0 | 0 | 1
1 | 1 | 0
consider creating a string of characters where each position corresponds to a single binary variable. Assuming the same structure you supplied for variables C1-C3, this would create a new variable (lets call it "C") which can take on values
000
001
010
100
011
101
110
111
You can then do a simple frequency analysis of these "patterns" to see how commonly each occurs. The Pareto principle would suggest that a high percentage of the observations would be concentrated in a small percentage of the patterns. Even doing this for a large number of C(i) variables can yield great insights into subsets which occur frequently such as
1***111*****01***1
where the * can be 0 or 1. Sorting by frequency or subsetting by key substrings can provide great insights into your data without the complexity that clustering can add to data which is already categorized. Should you have a vast number of C(i) variables (e.g. C1 - C100 or more), you might consider grouping some of the C(i) variables together and looking at the patterns for those subsets.
This approach also works for clustering when you have a large number of cluster variables. Suppose you have several key cluster variables which are numeric and interval. Instead of cramming all of them in one big cluster solution, consider creating meaningful subsets of variables and then clustering those subsets of variables. You can then create a character string like I described above where the number in the string corresponds to the cluster number for that dimension.
For a simple example, suppose you created for separate groups of variables assigned to Recency, Frequency, and Monetary. Suppose also there are 3 clusters for Recency, 4 clusters for Frequency, and 2 clusters for Monetary. Then you could construct the strings in the same way where
342 - represents cluster 3 on Recency, cluster 4 on Frequency, and cluster 2 on Monetary
132 - represents cluster 1 on Recency, cluster 3 on Frequency, and cluster 2 on Monetary
etc...
You will likely find the Pareto principle at work here as well, so this provides a meaningful way to help deconstruct multiple cluster solutions fit to key subsets of variables in such a way as to see structure lost in one big clustering solution.
Hope this helps!
Doug
Thanks Doug, do you also Standardize/Notmalize flag variables before using them in PROc factor?
There are scenarios when you might consider standardizing/normalizing variables and scenarios where you might not. You just need to think about how the interpretations differ and decide which one makes more sense for your research question. It is easy enough to run Factor Analysis both ways, one using the covariance matrix as the input and one using the correlation matrix as the input to get solutions using the raw data (covariance matrix) or standardized/normalized data (correlation matrix). This same question about whether to standardize/normalize can be found when doing Principal Components Analysis (PCA) and predictive modeling. In the end, methods which attempt to explain as much variability will be more influenced by variables with a larger amount of variability. It doesn't inherently makes sense to weight a variable more heavily just because you altered the measurement units (e.g. from miles to inches), but neither does it makes sense to normalize variables which all have (theoretically) the same scale (e.g. survey questions which measure strength of response).
In the case of a survey where people are expressing their strength of agreement/satisfaction on some scale, it is natural that some questions will have greater variability than others and this could happen for a variety of reasons.
* some questions were poorly worded
* some areas were much more problematic than others
* the focus group itself has certain biases which might differ from the population
If my survey is trying to assess which factors are most important to the focus group, standardizing/normalizing the strength of agreement/satisfaction variables makes no sense because you are trying to identify what factors matter. Normalizing in this situation effectively weighs every question equally regardless of how little variability the question represented. On the other hand, if my goal is to create an overall metric that I plan to evaluate over time, failing to standardize makes the resulting scores less comparable since each question is potentially providing a different amount of influence on each solution. In many cases, it might be that the survey instrument changes in response to reviewing results where certain questions had very little variability.
If my data is not naturally on the same scale (e.g. cost of a car in $, horsepower of a car, mileage of a car), then standardizing typically will make more sense but you also need to consider if your data represents the full range of values you want to consider or a narrow band of the population? If I have certain variables which vary over a tiny portion of the population of interest while other variables span a large proportion of the population, standardizing effectively makes the variable which only varies over a tiny proportion of the range for the population of interest much more important giving it the same weight as variables which contain data much more representative of the whole population.
Neither approach is wrong, they just need to be interpreted differently, and considering which type of interpretation is of greater interest should help you decided whether or not to standardize and how to prepare your data as a result.
Hope this helps!
Doug
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.