turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Applying fastclus on distance matrix obtained from...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 09:09 AM

I am working on a problem of clustering. I realized that since fastclus procedure uses Euclidean distance, it might not be very good a method for my data, as it contains both Continuous and binary data. So I created one distance matrix using Gower distance using "PROC DISTANCE" and tried to feed that as input to "PROC FASTCLUS". But produced only 1 cluster when specified clusters were 3.

However it is working fine on original dataset fed to "PROC DISTANCE". How can I obtain the clusters from the distance matrix?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 10:08 AM

You might be confusing PROC FASTCLUS with PROC CLUSTER. With PROC CLUSTER you can input a distance matrix (specify that it is a TYPE=DISTANCE data set). PROC CLUSTER can use the distances to create a dendrogram (tree) that clusters the observations. An example is given in the PROC CLUSTER documentation.

PROC FASTCLUS does not do hierachical clustering. The goal of PROC FASTCLUS is to find new points that are centers of the clusters. That requires knowing the coordinates of the observations themselves, rather than summarized data like a distance matrix.

When you input the distance matrix, PROC FASTCLUS thinks that you are given it a new data set that has N observations and N variables. It interprets the distances as being coordinates. It should report an ERROR if you input a TYPE=DISTANCE matrix. A Gower matrix has TYPE=SIMILAR, so PROC FASTCLUS will displaying a warning such as

WARNING: The DATA=WORK.DISTANCE.DATA data set is TYPE=SIMILAR, which is an unrecognized data set

type for the DATA= option, but it will be treated as an ordinary SAS data set.

Here is an example that demonstrates the difference between finding the two clusters for four 1-D points (the first call to PROC FASTCLUS) and finding two clusters for four 4-D points (the second call to PROC FASTCLUS):

```
proc fastclus data=A maxclusters=2;
var x;
run;
data A;
input x @@;
datalines;
0 1 5 6
;
/* find optimal centers for 1-D data at c1=0.5 and c2=5.5 */
proc fastclus data=A maxclusters=2;
var x;
run;
proc distance data=A out=Distance method=gower shape=square;
var interval(x);
run;
/* Find optimal centers for 4-D data. A WARNING message is
displayed because TYPE=DISTANCE data not supported. */
proc fastclus data=Distance maxclusters=2;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 01:02 PM

Thanks. But that does not solve my objective. How can I cluster the distance matrix output from proc distance?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 01:04 PM

As I mentioned in the first paragraph of my reply, you can use PROC CLUSTER to cluster the distance matrix.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 02:44 PM

I tried the proc cluster with method=density k=30. But it failed to produce any result.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 02:59 PM

Were there ERRORS or WARNINGS in the log? Without more details or access to your data, there's not a lot we can suggest. The following statements extract 50 obs fro mthe sashelp.iris data set. Calling PROC DISTANCE creates a 50x50 distance matrix. Then PROC CLUSTER uses the distance matrix to form a hierarchical cluster model. Perhaps you can modify this code to make yours work, or modify this code so that it "fails to produce any result."

```
data Iris;
length ID $3.;
set Sashelp.Iris ;
ID = _N_;
if _N_ < 50;
run;
proc distance data=Iris out=Distance method=gower shape=square;
var interval(SepalWidth SepalLength PetalWidth PetalLength);
id ID;
run;
proc cluster data=Distance(type=distance) method=density k=30;
id ID;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 03:17 PM

It says failed to allocate memory.

I have a 27000x27000 distance matrix. Why should it fail, as in real world the matrix can be really huge depending on the number of customers. Usually for big industries customers can range from few million- billion.

Is there a way to get good results in reasonable time with proc cluster? Or some divisive method to overcome this problem?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-27-2016 04:06 PM

You can compute how many gigabytes it takes to hold a matrix in memory. A 27000 x 27000 requires 5.43 GB of RAM just to store the distance matrix, and additional memory to compute the clusters. The most likely explanation is that you are not allocating sufficient RAM to your SAS process. I don't know how much you would need, but try to use the -MEMSIZE system option to set up 16G: Directions for setting the -MEMSIZE option.

Whether your computation will finish in a reasonable time is another matter, and I have no experience with problems of that size. Obviously (from the name) the FASTCLUS procedure is much faster. If you know that you are looking for a small number of clusters (maybe 50 or 100) it might be faster to run PROC FASTCLUS several times with varying MAXCLUSTER= values.

By the way, I just ran my example without the SHAPE=SQUARE option, and it works, so if your distance matrix is symmetric you might be able to save some memory by storing only half the matrix.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-28-2016 05:24 AM

Thanks for your inputs!!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-28-2016 08:28 AM

Ok, I obtained the results. But I am having difficulty understanding its output. I think it was formed as many clusters as the number of rows in matrix. How can i form only 3 clusters out of them? does it make sense to rank the _DENSE_ into 3 bins to get 3 clusters?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-28-2016 09:30 AM

You sound surprised, but a hierarchical tree partitions the observations into 1, 2, 3, 4,...., 27,000 clusters, which is one reason that PROC CLUSTER is slower than FASTCLUS, which only partitions the obs into k clusters for a particular choice of k. The distance or density variables enable you to prune the tree at a particular level, as shown in the CLUSTER documentation. You should study the doc closely to get a handle on these concepts.

You can use PROC TREE to prune the tree to three clusters. First use the OUTTREE=Tree option in the PROC CLUSTER statement to write the tree to a SAS data set. Then use the following (untested) code to visualize:

`proc tree data=tree noprint out=Prune3 ncl=3; /* prune to 3 clusters */`

/* optionally use HEIGHT stmt to specify the height var */
copy x1 x2 x3 ...; /* list important variables here */
run;
title "Plot of 3 Clusters";
proc sgscatter data=Prune3;
matrix x1 x2 x3 ... / group=cluster; /* impt vars */
run;

I recommend that you read the Getting Started Example and the chapter on Introduction to Clustering Procedures. Good stuff in there.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-08-2017 08:52 AM

Hello,

I'm faced with a similar situation: I have a Gower's similarity matrix that I need to use as an input for proc cluster. I followed your advice and allocated 12 GB of RAM to the SAS Process, however, it still wouldn't work. I get the following error:

WARNING: Unable to allocate sufficient memory. Amount requested was 2147483647, and the amount

available was 2147483647. External storage will be used for distances.

ERROR: Invalid position -2147479016 for utility file WORK.'SASTMP-000000006'n.UTILITY.

Could you please suggest what I could do?

Regards

MS

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2017 02:29 PM - edited 08-16-2017 02:30 PM

Hey Rick_SAS,

Is that you mean, we cann't fed the distance matrix (output dataset from PROC DISTANCE) in to PROC FASTCLUS.

If so then, how it is possible to use FASTCLUS for mixed datasets containg nominal,interval and binary variables.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2017 03:30 PM

Here is a link to the FASTCLUS documentation. The first sentence says "The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more **quantitative** variables." Quantitative means interval, not nominal and not binary.

The procedure analyzes data by using the (L_p) distance between observations, so it is inherently assuming continuous coordinates.

As I mentioned previously, you can use PROC DISTANCE and then input the distance matrix to PROC CLUSTER or PROC MODECLUS. But PROC FASTCLUS only analyzes continuous variables.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-16-2017 04:09 PM - edited 08-17-2017 12:46 PM

I'm seeking more clarity in this issue.

I have mixed dataset which contains interval and categorical variables and want to do FASTCLUS analysis for segmenting customers.

I'm trying to get distance matrix using PROC DISTANCE using METHOD=GOWER and input distance matrix in PROC CLUSTER.

By analyzing pseudo plots and other statistics I figure out 3 cluster solutions are best fit for my objective.

Now , for segmentation purpose I would like to input the __ Distance Matrix__ in PROC FASTCLUS with MAXC=3.

Can I do this? From your post I came to know that this is not doable in FASTCLUS.

Do you have any idea that how to work with mixed dataset in this situation.

Or I need to stick with only PROC CLUSTER for mixed dataset.

Thanks!