Programming the statistical procedures from SAS

Applying fastclus on distance matrix obtained from distance procedure

Reply
Regular Contributor
Posts: 188

Applying fastclus on distance matrix obtained from distance procedure

I am working on a problem of clustering. I realized that since fastclus procedure uses Euclidean distance, it might not be very good a method for my data, as it contains both Continuous and binary data. So I created one distance matrix using Gower distance using "PROC DISTANCE" and tried to feed that as input to "PROC FASTCLUS". But produced only 1 cluster when specified clusters were 3.

However it is working fine on original dataset fed to "PROC DISTANCE". How can I obtain the clusters from the distance matrix?

 

 

SAS Super FREQ
Posts: 3,630

Re: Applying fastclus on distance matrix obtained from distance procedure

You might be confusing PROC FASTCLUS with PROC CLUSTER. With PROC CLUSTER you can input a distance matrix (specify that it is a TYPE=DISTANCE data set). PROC CLUSTER can use the distances to create a dendrogram (tree) that clusters the observations. An example is given in the PROC CLUSTER documentation.

 

PROC FASTCLUS does not do hierachical clustering. The goal of PROC FASTCLUS is to find new points that are centers of the clusters. That requires knowing the coordinates of the observations themselves, rather than summarized data like a distance matrix.

 

When you input the distance matrix, PROC FASTCLUS thinks that you are given it a new data set that has N observations and N variables. It interprets the distances as being coordinates.  It should report an ERROR if you input a TYPE=DISTANCE matrix. A Gower matrix has TYPE=SIMILAR, so PROC FASTCLUS will displaying a warning such as

WARNING: The DATA=WORK.DISTANCE.DATA data set is TYPE=SIMILAR, which is an unrecognized data set

type for the DATA= option, but it will be treated as an ordinary SAS data set.

 

Here is an example that demonstrates the difference between finding the two clusters for four 1-D points (the first call to PROC FASTCLUS) and finding two clusters for four 4-D points (the second call to PROC FASTCLUS):

 

proc fastclus data=A maxclusters=2;
var x;
run;
data A;
input x @@;
datalines;
0 1 5 6
;

/* find optimal centers for 1-D data at c1=0.5 and c2=5.5 */
proc fastclus data=A maxclusters=2;
var x;
run;

proc distance data=A out=Distance method=gower shape=square;
var interval(x);
run;

/* Find optimal centers for 4-D data. A WARNING message is 
   displayed because TYPE=DISTANCE data not supported. */
proc fastclus data=Distance maxclusters=2;
run;

Regular Contributor
Posts: 188

Re: Applying fastclus on distance matrix obtained from distance procedure

Thanks. But that does not solve my objective. How can I cluster the distance matrix output from proc distance?
SAS Super FREQ
Posts: 3,630

Re: Applying fastclus on distance matrix obtained from distance procedure

As I mentioned in the first paragraph of my reply, you can use PROC CLUSTER to cluster the distance matrix.

Regular Contributor
Posts: 188

Re: Applying fastclus on distance matrix obtained from distance procedure

I tried the proc cluster with method=density k=30. But it failed to produce any result.

SAS Super FREQ
Posts: 3,630

Re: Applying fastclus on distance matrix obtained from distance procedure

Were there ERRORS or WARNINGS in the log?  Without more details or access to your data, there's not a lot we can suggest.  The following statements extract 50 obs fro mthe sashelp.iris data set. Calling PROC DISTANCE creates a 50x50 distance matrix. Then  PROC CLUSTER uses the distance matrix to form a hierarchical cluster model.  Perhaps you can modify this code to make yours work, or modify this code so that it "fails to produce any result."

 

data Iris;
length ID $3.;
set Sashelp.Iris ;
ID = _N_;
if _N_ < 50;
run;

proc distance data=Iris out=Distance method=gower shape=square;
var interval(SepalWidth SepalLength PetalWidth PetalLength);
id ID;
run;

proc cluster data=Distance(type=distance) method=density k=30;
id ID;
run;
Regular Contributor
Posts: 188

Re: Applying fastclus on distance matrix obtained from distance procedure

It says failed to allocate memory.

 I have a 27000x27000 distance matrix. Why should it fail, as in real world the matrix can be really huge depending on the number of customers. Usually for big industries customers can range from few million- billion.

 

Is there a way to get good results in reasonable time with proc cluster? Or some divisive method to overcome this problem?

SAS Super FREQ
Posts: 3,630

Re: Applying fastclus on distance matrix obtained from distance procedure

You can compute how many gigabytes it takes to hold a matrix in memory. A 27000 x 27000 requires 5.43 GB of RAM just to store the distance matrix, and additional memory to compute the clusters. The most likely explanation is that you are not allocating sufficient RAM to your SAS process. I don't know how much you would need, but try to use the -MEMSIZE system option to set up  16G: Directions for setting the -MEMSIZE option.

 

Whether your computation will finish in a reasonable time is another matter, and I have no experience with problems of that size. Obviously (from the name) the FASTCLUS procedure is much faster.   If you know that you are looking for a small number of clusters (maybe 50 or 100) it might be faster to run PROC FASTCLUS several times with varying MAXCLUSTER= values.

 

By the way, I just ran my example without the SHAPE=SQUARE option, and it works, so if your distance matrix is symmetric you might be able to save some memory by storing only half the matrix.

Regular Contributor
Posts: 188

Re: Applying fastclus on distance matrix obtained from distance procedure

Thanks for your inputs!!

Regular Contributor
Posts: 188

Re: Applying fastclus on distance matrix obtained from distance procedure

Ok, I obtained the results. But I am having difficulty understanding its output. I think it was formed as many clusters as the number of rows in matrix. How can i form only 3 clusters out of them? does it make sense to rank the _DENSE_ into 3 bins to get 3 clusters?

SAS Super FREQ
Posts: 3,630

Re: Applying fastclus on distance matrix obtained from distance procedure

You sound surprised, but a hierarchical tree partitions the observations into 1, 2, 3, 4,...., 27,000 clusters, which is one reason that PROC CLUSTER is slower than FASTCLUS, which only partitions the obs into k clusters for a particular choice of k.  The distance or density variables enable you to prune the tree at a particular level, as shown in the CLUSTER documentation.  You should study the doc closely to get a handle on these concepts.

 

You can use PROC TREE to prune the tree to three clusters.  First use the OUTTREE=Tree option in the PROC CLUSTER statement to write the tree to a SAS data set.  Then use the following (untested) code to visualize:

 

proc tree data=tree noprint out=Prune3 ncl=3; /* prune to 3 clusters */
/* optionally use HEIGHT stmt to specify the height var */ copy x1 x2 x3 ...; /* list important variables here */ run; title "Plot of 3 Clusters"; proc sgscatter data=Prune3; matrix x1 x2 x3 ... / group=cluster; /* impt vars */ run;

 

I recommend that you read the Getting Started Example and the chapter on Introduction to Clustering Procedures. Good stuff in there.

Contributor
Posts: 22

Re: Applying fastclus on distance matrix obtained from distance procedure

Hello,

 

I'm faced with a similar situation: I have a Gower's similarity matrix that I need to use as an input for proc cluster. I followed your advice and allocated 12 GB of RAM to the SAS Process, however, it still wouldn't work. I get the following error:

 

WARNING: Unable to allocate sufficient memory. Amount requested was 2147483647, and the amount
         available was 2147483647. External storage will be used for distances.
ERROR: Invalid position -2147479016 for utility file WORK.'SASTMP-000000006'n.UTILITY.

 

Could you please suggest what I could do?

 

Regards

MS

Contributor
Posts: 30

Re: Applying fastclus on distance matrix obtained from distance procedure

[ Edited ]

Hey Rick_SAS,

 

Is that you mean, we cann't fed the distance matrix (output dataset from PROC DISTANCE)  in to PROC FASTCLUS.

If so then, how it is possible to use FASTCLUS for mixed datasets containg nominal,interval and binary variables.

 

Thanks!

SAS Super FREQ
Posts: 3,630

Re: Applying fastclus on distance matrix obtained from distance procedure

Here is a link to the FASTCLUS documentation. The first sentence says "The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables." Quantitative means interval, not nominal and not binary.

 

The procedure analyzes data by using the (L_p) distance between observations, so it is inherently assuming continuous coordinates.

 

As I mentioned previously, you can use PROC DISTANCE and then input the distance matrix to PROC CLUSTER or PROC MODECLUS. But PROC FASTCLUS only analyzes continuous variables.

Contributor
Posts: 30

Re: Applying fastclus on distance matrix obtained from distance procedure

[ Edited ]

I'm seeking more clarity in this issue.

 

I have mixed dataset which contains interval and categorical variables and want to do FASTCLUS analysis for segmenting customers.

I'm trying to get distance matrix using PROC DISTANCE using METHOD=GOWER and input distance matrix in PROC CLUSTER.

By analyzing pseudo plots and other statistics I figure out 3 cluster solutions are best fit for my objective.

 

Now , for segmentation purpose I would like to input the Distance Matrix in PROC FASTCLUS with MAXC=3.

 

Can I do this? From your post I came to know that this is not doable in FASTCLUS.

 

Do you have any idea that how to work with mixed dataset in this situation.

Or I need to stick with only PROC CLUSTER for mixed dataset.

Thanks!

Ask a Question
Discussion stats
  • 15 replies
  • 1251 views
  • 2 likes
  • 4 in conversation