Programming the statistical procedures from SAS

Clustering in SAS

Reply
Occasional Contributor
Posts: 5

Clustering in SAS

Good afternoon, 

 

This is regarding clustering procedures in SAS. In this case, fasclus. I'm clustering a portfolio of 1 million customers in Jan 2016 based on profit, volume of transactions, and size of transactions. I found 6 clusters, numbered 1 to 6 in the output of my fastclus procedure. 

 

If I rerun my clustering in Jan 2015, will each of clusters retain the same underlying properties. For example, Cluster 1 in 2016 represents a high profit customer, with high volume and size transactions. Will Cluster 1 in 2015 reflect the same? high profit, volume and size? or will Cluster 1 be completely different?

 

For additional context, most of the customers remain the same between 2015 and 2016, and the three variables profit, volume of transaction and size are created the same in both periods. 

 

thank you

Valued Guide
Posts: 835

Re: Clustering in SAS

[ Edited ]

I think you are saying that you expect the same 6 cluster attributes from the next dataset - presumably because the dropped and added customers are not from different underlying populations.  And you are asking whether the cluster id assignment would be the same.  Is that correct?

 

If so, the answer is no.  I ran the FASTCLUS example in the sas documentation against the Fisher IRIS data as found in wkiipedia.  Then I sorted the data and reran fastclus - got the same clusters but with different ID's.  In other words, data order matters. 

 

You might get luckey and have the same cluster id assignments, but there is no guarantee.

 

data iris;
  input sepallength sepalwidth petallength petalwidth species $12.;
datalines;
5.1 3.5 1.4 0.2 I. setosa
4.9 3.0 1.4 0.2 I. setosa
4.7 3.2 1.3 0.2 I. setosa
4.6 3.1 1.5 0.2 I. setosa
5.0 3.6 1.4 0.3 I. setosa
5.4 3.9 1.7 0.4 I. setosa
4.6 3.4 1.4 0.3 I. setosa
5.0 3.4 1.5 0.2 I. setosa
4.4 2.9 1.4 0.2 I. setosa
4.9 3.1 1.5 0.1 I. setosa
5.4 3.7 1.5 0.2 I. setosa
4.8 3.4 1.6 0.2 I. setosa
4.8 3.0 1.4 0.1 I. setosa
4.3 3.0 1.1 0.1 I. setosa
5.8 4.0 1.2 0.2 I. setosa
5.7 4.4 1.5 0.4 I. setosa
5.4 3.9 1.3 0.4 I. setosa
5.1 3.5 1.4 0.3 I. setosa
5.7 3.8 1.7 0.3 I. setosa
5.1 3.8 1.5 0.3 I. setosa
5.4 3.4 1.7 0.2 I. setosa
5.1 3.7 1.5 0.4 I. setosa
4.6 3.6 1.0 0.2 I. setosa
5.1 3.3 1.7 0.5 I. setosa
4.8 3.4 1.9 0.2 I. setosa
5.0 3.0 1.6 0.2 I. setosa
5.0 3.4 1.6 0.4 I. setosa
5.2 3.5 1.5 0.2 I. setosa
5.2 3.4 1.4 0.2 I. setosa
4.7 3.2 1.6 0.2 I. setosa
4.8 3.1 1.6 0.2 I. setosa
5.4 3.4 1.5 0.4 I. setosa
5.2 4.1 1.5 0.1 I. setosa
5.5 4.2 1.4 0.2 I. setosa
4.9 3.1 1.5 0.2 I. setosa
5.0 3.2 1.2 0.2 I. setosa
5.5 3.5 1.3 0.2 I. setosa
4.9 3.6 1.4 0.1 I. setosa
4.4 3.0 1.3 0.2 I. setosa
5.1 3.4 1.5 0.2 I. setosa
5.0 3.5 1.3 0.3 I. setosa
4.5 2.3 1.3 0.3 I. setosa
4.4 3.2 1.3 0.2 I. setosa
5.0 3.5 1.6 0.6 I. setosa
5.1 3.8 1.9 0.4 I. setosa
4.8 3.0 1.4 0.3 I. setosa
5.1 3.8 1.6 0.2 I. setosa
4.6 3.2 1.4 0.2 I. setosa
5.3 3.7 1.5 0.2 I. setosa
5.0 3.3 1.4 0.2 I. setosa
7.0 3.2 4.7 1.4 I. versicolor
6.4 3.2 4.5 1.5 I. versicolor
6.9 3.1 4.9 1.5 I. versicolor
5.5 2.3 4.0 1.3 I. versicolor
6.5 2.8 4.6 1.5 I. versicolor
5.7 2.8 4.5 1.3 I. versicolor
6.3 3.3 4.7 1.6 I. versicolor
4.9 2.4 3.3 1.0 I. versicolor
6.6 2.9 4.6 1.3 I. versicolor
5.2 2.7 3.9 1.4 I. versicolor
5.0 2.0 3.5 1.0 I. versicolor
5.9 3.0 4.2 1.5 I. versicolor
6.0 2.2 4.0 1.0 I. versicolor
6.1 2.9 4.7 1.4 I. versicolor
5.6 2.9 3.6 1.3 I. versicolor
6.7 3.1 4.4 1.4 I. versicolor
5.6 3.0 4.5 1.5 I. versicolor
5.8 2.7 4.1 1.0 I. versicolor
6.2 2.2 4.5 1.5 I. versicolor
5.6 2.5 3.9 1.1 I. versicolor
5.9 3.2 4.8 1.8 I. versicolor
6.1 2.8 4.0 1.3 I. versicolor
6.3 2.5 4.9 1.5 I. versicolor
6.1 2.8 4.7 1.2 I. versicolor
6.4 2.9 4.3 1.3 I. versicolor
6.6 3.0 4.4 1.4 I. versicolor
6.8 2.8 4.8 1.4 I. versicolor
6.7 3.0 5.0 1.7 I. versicolor
6.0 2.9 4.5 1.5 I. versicolor
5.7 2.6 3.5 1.0 I. versicolor
5.5 2.4 3.8 1.1 I. versicolor
5.5 2.4 3.7 1.0 I. versicolor
5.8 2.7 3.9 1.2 I. versicolor
6.0 2.7 5.1 1.6 I. versicolor
5.4 3.0 4.5 1.5 I. versicolor
6.0 3.4 4.5 1.6 I. versicolor
6.7 3.1 4.7 1.5 I. versicolor
6.3 2.3 4.4 1.3 I. versicolor
5.6 3.0 4.1 1.3 I. versicolor
5.5 2.5 4.0 1.3 I. versicolor
5.5 2.6 4.4 1.2 I. versicolor
6.1 3.0 4.6 1.4 I. versicolor
5.8 2.6 4.0 1.2 I. versicolor
5.0 2.3 3.3 1.0 I. versicolor
5.6 2.7 4.2 1.3 I. versicolor
5.7 3.0 4.2 1.2 I. versicolor
5.7 2.9 4.2 1.3 I. versicolor
6.2 2.9 4.3 1.3 I. versicolor
5.1 2.5 3.0 1.1 I. versicolor
5.7 2.8 4.1 1.3 I. versicolor
6.3 3.3 6.0 2.5 I. virginica
5.8 2.7 5.1 1.9 I. virginica
7.1 3.0 5.9 2.1 I. virginica
6.3 2.9 5.6 1.8 I. virginica
6.5 3.0 5.8 2.2 I. virginica
7.6 3.0 6.6 2.1 I. virginica
4.9 2.5 4.5 1.7 I. virginica
7.3 2.9 6.3 1.8 I. virginica
6.7 2.5 5.8 1.8 I. virginica
7.2 3.6 6.1 2.5 I. virginica
6.5 3.2 5.1 2.0 I. virginica
6.4 2.7 5.3 1.9 I. virginica
6.8 3.0 5.5 2.1 I. virginica
5.7 2.5 5.0 2.0 I. virginica
5.8 2.8 5.1 2.4 I. virginica
6.4 3.2 5.3 2.3 I. virginica
6.5 3.0 5.5 1.8 I. virginica
7.7 3.8 6.7 2.2 I. virginica
7.7 2.6 6.9 2.3 I. virginica
6.0 2.2 5.0 1.5 I. virginica
6.9 3.2 5.7 2.3 I. virginica
5.6 2.8 4.9 2.0 I. virginica
7.7 2.8 6.7 2.0 I. virginica
6.3 2.7 4.9 1.8 I. virginica
6.7 3.3 5.7 2.1 I. virginica
7.2 3.2 6.0 1.8 I. virginica
6.2 2.8 4.8 1.8 I. virginica
6.1 3.0 4.9 1.8 I. virginica
6.4 2.8 5.6 2.1 I. virginica
7.2 3.0 5.8 1.6 I. virginica
7.4 2.8 6.1 1.9 I. virginica
7.9 3.8 6.4 2.0 I. virginica
6.4 2.8 5.6 2.2 I. virginica
6.3 2.8 5.1 1.5 I. virginica
6.1 2.6 5.6 1.4 I. virginica
7.7 3.0 6.1 2.3 I. virginica
6.3 3.4 5.6 2.4 I. virginica
6.4 3.1 5.5 1.8 I. virginica
6.0 3.0 4.8 1.8 I. virginica
6.9 3.1 5.4 2.1 I. virginica
6.7 3.1 5.6 2.4 I. virginica
6.9 3.1 5.1 2.3 I. virginica
5.8 2.7 5.1 1.9 I. virginica
6.8 3.2 5.9 2.3 I. virginica
6.7 3.3 5.7 2.5 I. virginica
6.7 3.0 5.2 2.3 I. virginica
6.3 2.5 5.0 1.9 I. virginica
6.5 3.0 5.2 2.0 I. virginica
6.2 3.4 5.4 2.3 I. virginica
5.9 3.0 5.1 1.8 I. virginica
run;
proc fastclus data=iris maxc=2 maxiter=10 out=clus;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;

proc sort data=iris;
  by descending sepallength sepalwidth;
run;
proc fastclus data=iris maxc=2 maxiter=10 out=clus2;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;

 

Occasional Contributor
Posts: 5

Re: Clustering in SAS

Thanks Mark.. I appreciate the response. That's exactly what I meant. Do you know of a way to retain the same properties? Is there a segmentation logic which I can leverage?

 

 

Valued Guide
Posts: 835

Re: Clustering in SAS

[ Edited ]

Why not make 6 synthetic customers from the first FASTCLUS, each with clustering vars at the centroid of the corresponding cluster.  You can use the MEAN= option to output the cluster centers to a dataset, as I've done with MY_CLUSTER_CENTERS below.

 

Then Include those 6 in the second FASTCLUS (PROC APPEND).   Find them in the fastclus results, (they'lll be the last 6 observatins in the dataset) and then you'll know how to recode your clusterid's (the ID's from the appended obs are the original cluster id's).  .

 

Editted addition: this all assumes sufficient stability in your changing data sets such that you are confident that the nature of the 6 clusters does not change over time

 

In the case of the IRIS data, with 2 clusters it would be this:

 

/* Make the first data set - if it doesn't have an ID var, make one */
data iris;
   title 'Fisher (1936) Iris Data';
   input SepalLength SepalWidth PetalLength PetalWidth Species $12.;
   label SepalLength='Sepal Length in mm.'
         SepalWidth ='Sepal Width in mm.'
         PetalLength='Petal Length in mm.'
         PetalWidth ='Petal Width in mm.';
   id=1000000+_n_;
datalines;
5.1 3.5 1.4 0.2 I. setosa 
4.9 3.0 1.4 0.2 I. setosa 
4.7 3.2 1.3 0.2 I. setosa 
4.6 3.1 1.5 0.2 I. setosa 
5.0 3.6 1.4 0.3 I. setosa 
5.4 3.9 1.7 0.4 I. setosa 
4.6 3.4 1.4 0.3 I. setosa 
5.0 3.4 1.5 0.2 I. setosa 
4.4 2.9 1.4 0.2 I. setosa 
4.9 3.1 1.5 0.1 I. setosa 
5.4 3.7 1.5 0.2 I. setosa 
4.8 3.4 1.6 0.2 I. setosa 
4.8 3.0 1.4 0.1 I. setosa 
4.3 3.0 1.1 0.1 I. setosa 
5.8 4.0 1.2 0.2 I. setosa 
5.7 4.4 1.5 0.4 I. setosa 
5.4 3.9 1.3 0.4 I. setosa 
5.1 3.5 1.4 0.3 I. setosa 
5.7 3.8 1.7 0.3 I. setosa 
5.1 3.8 1.5 0.3 I. setosa 
5.4 3.4 1.7 0.2 I. setosa 
5.1 3.7 1.5 0.4 I. setosa 
4.6 3.6 1.0 0.2 I. setosa 
5.1 3.3 1.7 0.5 I. setosa 
4.8 3.4 1.9 0.2 I. setosa 
5.0 3.0 1.6 0.2 I. setosa 
5.0 3.4 1.6 0.4 I. setosa 
5.2 3.5 1.5 0.2 I. setosa 
5.2 3.4 1.4 0.2 I. setosa 
4.7 3.2 1.6 0.2 I. setosa 
4.8 3.1 1.6 0.2 I. setosa 
5.4 3.4 1.5 0.4 I. setosa 
5.2 4.1 1.5 0.1 I. setosa 
5.5 4.2 1.4 0.2 I. setosa 
4.9 3.1 1.5 0.2 I. setosa 
5.0 3.2 1.2 0.2 I. setosa 
5.5 3.5 1.3 0.2 I. setosa 
4.9 3.6 1.4 0.1 I. setosa 
4.4 3.0 1.3 0.2 I. setosa 
5.1 3.4 1.5 0.2 I. setosa 
5.0 3.5 1.3 0.3 I. setosa 
4.5 2.3 1.3 0.3 I. setosa 
4.4 3.2 1.3 0.2 I. setosa 
5.0 3.5 1.6 0.6 I. setosa 
5.1 3.8 1.9 0.4 I. setosa 
4.8 3.0 1.4 0.3 I. setosa 
5.1 3.8 1.6 0.2 I. setosa 
4.6 3.2 1.4 0.2 I. setosa 
5.3 3.7 1.5 0.2 I. setosa 
5.0 3.3 1.4 0.2 I. setosa 
7.0 3.2 4.7 1.4 I. versicolor 
6.4 3.2 4.5 1.5 I. versicolor 
6.9 3.1 4.9 1.5 I. versicolor 
5.5 2.3 4.0 1.3 I. versicolor 
6.5 2.8 4.6 1.5 I. versicolor 
5.7 2.8 4.5 1.3 I. versicolor 
6.3 3.3 4.7 1.6 I. versicolor 
4.9 2.4 3.3 1.0 I. versicolor 
6.6 2.9 4.6 1.3 I. versicolor 
5.2 2.7 3.9 1.4 I. versicolor 
5.0 2.0 3.5 1.0 I. versicolor 
5.9 3.0 4.2 1.5 I. versicolor 
6.0 2.2 4.0 1.0 I. versicolor 
6.1 2.9 4.7 1.4 I. versicolor 
5.6 2.9 3.6 1.3 I. versicolor 
6.7 3.1 4.4 1.4 I. versicolor 
5.6 3.0 4.5 1.5 I. versicolor 
5.8 2.7 4.1 1.0 I. versicolor 
6.2 2.2 4.5 1.5 I. versicolor 
5.6 2.5 3.9 1.1 I. versicolor 
5.9 3.2 4.8 1.8 I. versicolor 
6.1 2.8 4.0 1.3 I. versicolor 
6.3 2.5 4.9 1.5 I. versicolor 
6.1 2.8 4.7 1.2 I. versicolor 
6.4 2.9 4.3 1.3 I. versicolor 
6.6 3.0 4.4 1.4 I. versicolor 
6.8 2.8 4.8 1.4 I. versicolor 
6.7 3.0 5.0 1.7 I. versicolor 
6.0 2.9 4.5 1.5 I. versicolor 
5.7 2.6 3.5 1.0 I. versicolor 
5.5 2.4 3.8 1.1 I. versicolor 
5.5 2.4 3.7 1.0 I. versicolor 
5.8 2.7 3.9 1.2 I. versicolor 
6.0 2.7 5.1 1.6 I. versicolor 
5.4 3.0 4.5 1.5 I. versicolor 
6.0 3.4 4.5 1.6 I. versicolor 
6.7 3.1 4.7 1.5 I. versicolor 
6.3 2.3 4.4 1.3 I. versicolor 
5.6 3.0 4.1 1.3 I. versicolor 
5.5 2.5 4.0 1.3 I. versicolor 
5.5 2.6 4.4 1.2 I. versicolor 
6.1 3.0 4.6 1.4 I. versicolor 
5.8 2.6 4.0 1.2 I. versicolor 
5.0 2.3 3.3 1.0 I. versicolor 
5.6 2.7 4.2 1.3 I. versicolor 
5.7 3.0 4.2 1.2 I. versicolor 
5.7 2.9 4.2 1.3 I. versicolor 
6.2 2.9 4.3 1.3 I. versicolor 
5.1 2.5 3.0 1.1 I. versicolor 
5.7 2.8 4.1 1.3 I. versicolor 
6.3 3.3 6.0 2.5 I. virginica 
5.8 2.7 5.1 1.9 I. virginica 
7.1 3.0 5.9 2.1 I. virginica 
6.3 2.9 5.6 1.8 I. virginica 
6.5 3.0 5.8 2.2 I. virginica 
7.6 3.0 6.6 2.1 I. virginica 
4.9 2.5 4.5 1.7 I. virginica 
7.3 2.9 6.3 1.8 I. virginica 
6.7 2.5 5.8 1.8 I. virginica 
7.2 3.6 6.1 2.5 I. virginica 
6.5 3.2 5.1 2.0 I. virginica 
6.4 2.7 5.3 1.9 I. virginica 
6.8 3.0 5.5 2.1 I. virginica 
5.7 2.5 5.0 2.0 I. virginica 
5.8 2.8 5.1 2.4 I. virginica 
6.4 3.2 5.3 2.3 I. virginica 
6.5 3.0 5.5 1.8 I. virginica 
7.7 3.8 6.7 2.2 I. virginica 
7.7 2.6 6.9 2.3 I. virginica 
6.0 2.2 5.0 1.5 I. virginica 
6.9 3.2 5.7 2.3 I. virginica 
5.6 2.8 4.9 2.0 I. virginica 
7.7 2.8 6.7 2.0 I. virginica 
6.3 2.7 4.9 1.8 I. virginica 
6.7 3.3 5.7 2.1 I. virginica 
7.2 3.2 6.0 1.8 I. virginica 
6.2 2.8 4.8 1.8 I. virginica 
6.1 3.0 4.9 1.8 I. virginica 
6.4 2.8 5.6 2.1 I. virginica 
7.2 3.0 5.8 1.6 I. virginica 
7.4 2.8 6.1 1.9 I. virginica 
7.9 3.8 6.4 2.0 I. virginica 
6.4 2.8 5.6 2.2 I. virginica 
6.3 2.8 5.1 1.5 I. virginica 
6.1 2.6 5.6 1.4 I. virginica 
7.7 3.0 6.1 2.3 I. virginica 
6.3 3.4 5.6 2.4 I. virginica 
6.4 3.1 5.5 1.8 I. virginica 
6.0 3.0 4.8 1.8 I. virginica 
6.9 3.1 5.4 2.1 I. virginica 
6.7 3.1 5.6 2.4 I. virginica 
6.9 3.1 5.1 2.3 I. virginica 
5.8 2.7 5.1 1.9 I. virginica 
6.8 3.2 5.9 2.3 I. virginica 
6.7 3.3 5.7 2.5 I. virginica 
6.7 3.0 5.2 2.3 I. virginica 
6.3 2.5 5.0 1.9 I. virginica 
6.5 3.0 5.2 2.0 I. virginica 
6.2 3.4 5.4 2.3 I. virginica 
5.9 3.0 5.1 1.8 I. virginica 
run;

proc sort data=iris  out=iris2;
  by descending sepallength sepalwidth;
run;

/*cluster the first data set */
proc fastclus data=iris maxc=2 maxiter=10 out=clus mean=my_clustercenters  noprint;
  var SepalLength SepalWidth PetalLength PetalWidth;
run;

/* append cluster centroids to second data set, renaming CLUSTER to the id var */ 
proc append base=iris2 data=my_clustercenters (rename=(cluster=id)) force;
run;

proc fastclus data=iris2 maxc=2 maxiter=10 out=clus2 noprint;
  var SepalLength SepalWidth PetalLength PetalWidth;
run;

/* Find the centroids and set up the recode of new cluster results */
data final_clus;
  array cluster_lookup {2} _temporary_;
  if _n_=1 then do p=nrecs-1 to nrecs;
    set clus2 nobs=nrecs point=p; 
    cluster_lookup{cluster}=id; 
  end;
  set clus2;
  final_cluster=cluster_lookup{cluster};
  if _n_>=nrecs-1 then stop;
run;

 

 

Valued Guide
Posts: 835

Re: Clustering in SAS

ACtually, it also looks like the OUTSTAT and INSTAT options might work for you.  These options are defined as:

 

The INSTAT= option reads a SAS data set previously created with the FASTCLUS procedure by using the OUTSTAT= option. If you specify the INSTAT= option, no clustering iterations are performed and no output is produced. Only cluster assignment and imputation are performed as an OUT= data set is created.

 

It seems to me you could use OUTSTAT= in the first fastclus, and use it as INSTAT for the subsequent fastclus.

 

proc fastclus data=my_orig_data outstat=orig_stats  .... other options ....;

  var ....;

run;

 

Then

 

proc fastclus data=my_new_data instat=orig_stats  .... other options ....;

  var ....;

run;

 

 

This option assumes your first data set was representative of all the subsequent data sets.

Occasional Contributor
Posts: 5

Re: Clustering in SAS

Fantastic!! thank you so much. I'll give this a try. 

Ask a Question
Discussion stats
  • 5 replies
  • 134 views
  • 1 like
  • 2 in conversation