BookmarkSubscribeRSS Feed
munitech4u
Quartz | Level 8

I am working on a problem of clustering. I realized that since fastclus procedure uses Euclidean distance, it might not be very good a method for my data, as it contains both Continuous and binary data. So I created one distance matrix using Gower distance using "PROC DISTANCE" and tried to feed that as input to "PROC FASTCLUS". But produced only 1 cluster when specified clusters were 3.

However it is working fine on original dataset fed to "PROC DISTANCE". How can I obtain the clusters from the distance matrix?

 

 

15 REPLIES 15
Rick_SAS
SAS Super FREQ

You might be confusing PROC FASTCLUS with PROC CLUSTER. With PROC CLUSTER you can input a distance matrix (specify that it is a TYPE=DISTANCE data set). PROC CLUSTER can use the distances to create a dendrogram (tree) that clusters the observations. An example is given in the PROC CLUSTER documentation.

 

PROC FASTCLUS does not do hierachical clustering. The goal of PROC FASTCLUS is to find new points that are centers of the clusters. That requires knowing the coordinates of the observations themselves, rather than summarized data like a distance matrix.

 

When you input the distance matrix, PROC FASTCLUS thinks that you are given it a new data set that has N observations and N variables. It interprets the distances as being coordinates.  It should report an ERROR if you input a TYPE=DISTANCE matrix. A Gower matrix has TYPE=SIMILAR, so PROC FASTCLUS will displaying a warning such as

WARNING: The DATA=WORK.DISTANCE.DATA data set is TYPE=SIMILAR, which is an unrecognized data set

type for the DATA= option, but it will be treated as an ordinary SAS data set.

 

Here is an example that demonstrates the difference between finding the two clusters for four 1-D points (the first call to PROC FASTCLUS) and finding two clusters for four 4-D points (the second call to PROC FASTCLUS):

 

proc fastclus data=A maxclusters=2;
var x;
run;
data A;
input x @@;
datalines;
0 1 5 6
;

/* find optimal centers for 1-D data at c1=0.5 and c2=5.5 */
proc fastclus data=A maxclusters=2;
var x;
run;

proc distance data=A out=Distance method=gower shape=square;
var interval(x);
run;

/* Find optimal centers for 4-D data. A WARNING message is 
   displayed because TYPE=DISTANCE data not supported. */
proc fastclus data=Distance maxclusters=2;
run;

munitech4u
Quartz | Level 8
Thanks. But that does not solve my objective. How can I cluster the distance matrix output from proc distance?
Rick_SAS
SAS Super FREQ

As I mentioned in the first paragraph of my reply, you can use PROC CLUSTER to cluster the distance matrix.

munitech4u
Quartz | Level 8

I tried the proc cluster with method=density k=30. But it failed to produce any result.

Rick_SAS
SAS Super FREQ

Were there ERRORS or WARNINGS in the log?  Without more details or access to your data, there's not a lot we can suggest.  The following statements extract 50 obs fro mthe sashelp.iris data set. Calling PROC DISTANCE creates a 50x50 distance matrix. Then  PROC CLUSTER uses the distance matrix to form a hierarchical cluster model.  Perhaps you can modify this code to make yours work, or modify this code so that it "fails to produce any result."

 

data Iris;
length ID $3.;
set Sashelp.Iris ;
ID = _N_;
if _N_ < 50;
run;

proc distance data=Iris out=Distance method=gower shape=square;
var interval(SepalWidth SepalLength PetalWidth PetalLength);
id ID;
run;

proc cluster data=Distance(type=distance) method=density k=30;
id ID;
run;
munitech4u
Quartz | Level 8

It says failed to allocate memory.

 I have a 27000x27000 distance matrix. Why should it fail, as in real world the matrix can be really huge depending on the number of customers. Usually for big industries customers can range from few million- billion.

 

Is there a way to get good results in reasonable time with proc cluster? Or some divisive method to overcome this problem?

Rick_SAS
SAS Super FREQ

You can compute how many gigabytes it takes to hold a matrix in memory. A 27000 x 27000 requires 5.43 GB of RAM just to store the distance matrix, and additional memory to compute the clusters. The most likely explanation is that you are not allocating sufficient RAM to your SAS process. I don't know how much you would need, but try to use the -MEMSIZE system option to set up  16G: Directions for setting the -MEMSIZE option.

 

Whether your computation will finish in a reasonable time is another matter, and I have no experience with problems of that size. Obviously (from the name) the FASTCLUS procedure is much faster.   If you know that you are looking for a small number of clusters (maybe 50 or 100) it might be faster to run PROC FASTCLUS several times with varying MAXCLUSTER= values.

 

By the way, I just ran my example without the SHAPE=SQUARE option, and it works, so if your distance matrix is symmetric you might be able to save some memory by storing only half the matrix.

munitech4u
Quartz | Level 8

Ok, I obtained the results. But I am having difficulty understanding its output. I think it was formed as many clusters as the number of rows in matrix. How can i form only 3 clusters out of them? does it make sense to rank the _DENSE_ into 3 bins to get 3 clusters?

Rick_SAS
SAS Super FREQ

You sound surprised, but a hierarchical tree partitions the observations into 1, 2, 3, 4,...., 27,000 clusters, which is one reason that PROC CLUSTER is slower than FASTCLUS, which only partitions the obs into k clusters for a particular choice of k.  The distance or density variables enable you to prune the tree at a particular level, as shown in the CLUSTER documentation.  You should study the doc closely to get a handle on these concepts.

 

You can use PROC TREE to prune the tree to three clusters.  First use the OUTTREE=Tree option in the PROC CLUSTER statement to write the tree to a SAS data set.  Then use the following (untested) code to visualize:

 

proc tree data=tree noprint out=Prune3 ncl=3; /* prune to 3 clusters */
/* optionally use HEIGHT stmt to specify the height var */ copy x1 x2 x3 ...; /* list important variables here */ run; title "Plot of 3 Clusters"; proc sgscatter data=Prune3; matrix x1 x2 x3 ... / group=cluster; /* impt vars */ run;

 

I recommend that you read the Getting Started Example and the chapter on Introduction to Clustering Procedures. Good stuff in there.

mszommer
Obsidian | Level 7

Hello,

 

I'm faced with a similar situation: I have a Gower's similarity matrix that I need to use as an input for proc cluster. I followed your advice and allocated 12 GB of RAM to the SAS Process, however, it still wouldn't work. I get the following error:

 

WARNING: Unable to allocate sufficient memory. Amount requested was 2147483647, and the amount
         available was 2147483647. External storage will be used for distances.
ERROR: Invalid position -2147479016 for utility file WORK.'SASTMP-000000006'n.UTILITY.

 

Could you please suggest what I could do?

 

Regards

MS

koomalkc
Fluorite | Level 6

Hey Rick_SAS,

 

Is that you mean, we cann't fed the distance matrix (output dataset from PROC DISTANCE)  in to PROC FASTCLUS.

If so then, how it is possible to use FASTCLUS for mixed datasets containg nominal,interval and binary variables.

 

Thanks!

Rick_SAS
SAS Super FREQ

Here is a link to the FASTCLUS documentation. The first sentence says "The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables." Quantitative means interval, not nominal and not binary.

 

The procedure analyzes data by using the (L_p) distance between observations, so it is inherently assuming continuous coordinates.

 

As I mentioned previously, you can use PROC DISTANCE and then input the distance matrix to PROC CLUSTER or PROC MODECLUS. But PROC FASTCLUS only analyzes continuous variables.

koomalkc
Fluorite | Level 6

I'm seeking more clarity in this issue.

 

I have mixed dataset which contains interval and categorical variables and want to do FASTCLUS analysis for segmenting customers.

I'm trying to get distance matrix using PROC DISTANCE using METHOD=GOWER and input distance matrix in PROC CLUSTER.

By analyzing pseudo plots and other statistics I figure out 3 cluster solutions are best fit for my objective.

 

Now , for segmentation purpose I would like to input the Distance Matrix in PROC FASTCLUS with MAXC=3.

 

Can I do this? From your post I came to know that this is not doable in FASTCLUS.

 

Do you have any idea that how to work with mixed dataset in this situation.

Or I need to stick with only PROC CLUSTER for mixed dataset.

Thanks!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 4679 views
  • 2 likes
  • 4 in conversation