06-27-2016 09:09 AM
I am working on a problem of clustering. I realized that since fastclus procedure uses Euclidean distance, it might not be very good a method for my data, as it contains both Continuous and binary data. So I created one distance matrix using Gower distance using "PROC DISTANCE" and tried to feed that as input to "PROC FASTCLUS". But produced only 1 cluster when specified clusters were 3.
However it is working fine on original dataset fed to "PROC DISTANCE". How can I obtain the clusters from the distance matrix?
06-27-2016 10:08 AM
You might be confusing PROC FASTCLUS with PROC CLUSTER. With PROC CLUSTER you can input a distance matrix (specify that it is a TYPE=DISTANCE data set). PROC CLUSTER can use the distances to create a dendrogram (tree) that clusters the observations. An example is given in the PROC CLUSTER documentation.
PROC FASTCLUS does not do hierachical clustering. The goal of PROC FASTCLUS is to find new points that are centers of the clusters. That requires knowing the coordinates of the observations themselves, rather than summarized data like a distance matrix.
When you input the distance matrix, PROC FASTCLUS thinks that you are given it a new data set that has N observations and N variables. It interprets the distances as being coordinates. It should report an ERROR if you input a TYPE=DISTANCE matrix. A Gower matrix has TYPE=SIMILAR, so PROC FASTCLUS will displaying a warning such as
WARNING: The DATA=WORK.DISTANCE.DATA data set is TYPE=SIMILAR, which is an unrecognized data set
type for the DATA= option, but it will be treated as an ordinary SAS data set.
Here is an example that demonstrates the difference between finding the two clusters for four 1-D points (the first call to PROC FASTCLUS) and finding two clusters for four 4-D points (the second call to PROC FASTCLUS):
proc fastclus data=A maxclusters=2; var x; run; data A; input x @@; datalines; 0 1 5 6 ; /* find optimal centers for 1-D data at c1=0.5 and c2=5.5 */ proc fastclus data=A maxclusters=2; var x; run; proc distance data=A out=Distance method=gower shape=square; var interval(x); run; /* Find optimal centers for 4-D data. A WARNING message is displayed because TYPE=DISTANCE data not supported. */ proc fastclus data=Distance maxclusters=2; run;
06-27-2016 02:59 PM
Were there ERRORS or WARNINGS in the log? Without more details or access to your data, there's not a lot we can suggest. The following statements extract 50 obs fro mthe sashelp.iris data set. Calling PROC DISTANCE creates a 50x50 distance matrix. Then PROC CLUSTER uses the distance matrix to form a hierarchical cluster model. Perhaps you can modify this code to make yours work, or modify this code so that it "fails to produce any result."
data Iris; length ID $3.; set Sashelp.Iris ; ID = _N_; if _N_ < 50; run; proc distance data=Iris out=Distance method=gower shape=square; var interval(SepalWidth SepalLength PetalWidth PetalLength); id ID; run; proc cluster data=Distance(type=distance) method=density k=30; id ID; run;
06-27-2016 03:17 PM
It says failed to allocate memory.
I have a 27000x27000 distance matrix. Why should it fail, as in real world the matrix can be really huge depending on the number of customers. Usually for big industries customers can range from few million- billion.
Is there a way to get good results in reasonable time with proc cluster? Or some divisive method to overcome this problem?
06-27-2016 04:06 PM
You can compute how many gigabytes it takes to hold a matrix in memory. A 27000 x 27000 requires 5.43 GB of RAM just to store the distance matrix, and additional memory to compute the clusters. The most likely explanation is that you are not allocating sufficient RAM to your SAS process. I don't know how much you would need, but try to use the -MEMSIZE system option to set up 16G: Directions for setting the -MEMSIZE option.
Whether your computation will finish in a reasonable time is another matter, and I have no experience with problems of that size. Obviously (from the name) the FASTCLUS procedure is much faster. If you know that you are looking for a small number of clusters (maybe 50 or 100) it might be faster to run PROC FASTCLUS several times with varying MAXCLUSTER= values.
By the way, I just ran my example without the SHAPE=SQUARE option, and it works, so if your distance matrix is symmetric you might be able to save some memory by storing only half the matrix.
06-28-2016 08:28 AM
Ok, I obtained the results. But I am having difficulty understanding its output. I think it was formed as many clusters as the number of rows in matrix. How can i form only 3 clusters out of them? does it make sense to rank the _DENSE_ into 3 bins to get 3 clusters?
06-28-2016 09:30 AM
You sound surprised, but a hierarchical tree partitions the observations into 1, 2, 3, 4,...., 27,000 clusters, which is one reason that PROC CLUSTER is slower than FASTCLUS, which only partitions the obs into k clusters for a particular choice of k. The distance or density variables enable you to prune the tree at a particular level, as shown in the CLUSTER documentation. You should study the doc closely to get a handle on these concepts.
You can use PROC TREE to prune the tree to three clusters. First use the OUTTREE=Tree option in the PROC CLUSTER statement to write the tree to a SAS data set. Then use the following (untested) code to visualize:
proc tree data=tree noprint out=Prune3 ncl=3; /* prune to 3 clusters */
/* optionally use HEIGHT stmt to specify the height var */ copy x1 x2 x3 ...; /* list important variables here */ run; title "Plot of 3 Clusters"; proc sgscatter data=Prune3; matrix x1 x2 x3 ... / group=cluster; /* impt vars */ run;
03-08-2017 08:52 AM
I'm faced with a similar situation: I have a Gower's similarity matrix that I need to use as an input for proc cluster. I followed your advice and allocated 12 GB of RAM to the SAS Process, however, it still wouldn't work. I get the following error:
WARNING: Unable to allocate sufficient memory. Amount requested was 2147483647, and the amount
available was 2147483647. External storage will be used for distances.
ERROR: Invalid position -2147479016 for utility file WORK.'SASTMP-000000006'n.UTILITY.
Could you please suggest what I could do?
08-16-2017 02:29 PM - edited 08-16-2017 02:30 PM
Is that you mean, we cann't fed the distance matrix (output dataset from PROC DISTANCE) in to PROC FASTCLUS.
If so then, how it is possible to use FASTCLUS for mixed datasets containg nominal,interval and binary variables.
08-16-2017 03:30 PM
Here is a link to the FASTCLUS documentation. The first sentence says "The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables." Quantitative means interval, not nominal and not binary.
The procedure analyzes data by using the (L_p) distance between observations, so it is inherently assuming continuous coordinates.
As I mentioned previously, you can use PROC DISTANCE and then input the distance matrix to PROC CLUSTER or PROC MODECLUS. But PROC FASTCLUS only analyzes continuous variables.
08-16-2017 04:09 PM - edited 08-17-2017 12:46 PM
I'm seeking more clarity in this issue.
I have mixed dataset which contains interval and categorical variables and want to do FASTCLUS analysis for segmenting customers.
I'm trying to get distance matrix using PROC DISTANCE using METHOD=GOWER and input distance matrix in PROC CLUSTER.
By analyzing pseudo plots and other statistics I figure out 3 cluster solutions are best fit for my objective.
Now , for segmentation purpose I would like to input the Distance Matrix in PROC FASTCLUS with MAXC=3.
Can I do this? From your post I came to know that this is not doable in FASTCLUS.
Do you have any idea that how to work with mixed dataset in this situation.
Or I need to stick with only PROC CLUSTER for mixed dataset.