BookmarkSubscribeRSS Feed
pacinoxl1212
Calcite | Level 5

Good afternoon, 

 

This is regarding clustering procedures in SAS. In this case, fasclus. I'm clustering a portfolio of 1 million customers in Jan 2016 based on profit, volume of transactions, and size of transactions. I found 6 clusters, numbered 1 to 6 in the output of my fastclus procedure. 

 

If I rerun my clustering in Jan 2015, will each of clusters retain the same underlying properties. For example, Cluster 1 in 2016 represents a high profit customer, with high volume and size transactions. Will Cluster 1 in 2015 reflect the same? high profit, volume and size? or will Cluster 1 be completely different?

 

For additional context, most of the customers remain the same between 2015 and 2016, and the three variables profit, volume of transaction and size are created the same in both periods. 

 

thank you

5 REPLIES 5
mkeintz
PROC Star

I think you are saying that you expect the same 6 cluster attributes from the next dataset - presumably because the dropped and added customers are not from different underlying populations.  And you are asking whether the cluster id assignment would be the same.  Is that correct?

 

If so, the answer is no.  I ran the FASTCLUS example in the sas documentation against the Fisher IRIS data as found in wkiipedia.  Then I sorted the data and reran fastclus - got the same clusters but with different ID's.  In other words, data order matters. 

 

You might get luckey and have the same cluster id assignments, but there is no guarantee.

 

data iris;
  input sepallength sepalwidth petallength petalwidth species $12.;
datalines;
5.1 3.5 1.4 0.2 I. setosa
4.9 3.0 1.4 0.2 I. setosa
4.7 3.2 1.3 0.2 I. setosa
4.6 3.1 1.5 0.2 I. setosa
5.0 3.6 1.4 0.3 I. setosa
5.4 3.9 1.7 0.4 I. setosa
4.6 3.4 1.4 0.3 I. setosa
5.0 3.4 1.5 0.2 I. setosa
4.4 2.9 1.4 0.2 I. setosa
4.9 3.1 1.5 0.1 I. setosa
5.4 3.7 1.5 0.2 I. setosa
4.8 3.4 1.6 0.2 I. setosa
4.8 3.0 1.4 0.1 I. setosa
4.3 3.0 1.1 0.1 I. setosa
5.8 4.0 1.2 0.2 I. setosa
5.7 4.4 1.5 0.4 I. setosa
5.4 3.9 1.3 0.4 I. setosa
5.1 3.5 1.4 0.3 I. setosa
5.7 3.8 1.7 0.3 I. setosa
5.1 3.8 1.5 0.3 I. setosa
5.4 3.4 1.7 0.2 I. setosa
5.1 3.7 1.5 0.4 I. setosa
4.6 3.6 1.0 0.2 I. setosa
5.1 3.3 1.7 0.5 I. setosa
4.8 3.4 1.9 0.2 I. setosa
5.0 3.0 1.6 0.2 I. setosa
5.0 3.4 1.6 0.4 I. setosa
5.2 3.5 1.5 0.2 I. setosa
5.2 3.4 1.4 0.2 I. setosa
4.7 3.2 1.6 0.2 I. setosa
4.8 3.1 1.6 0.2 I. setosa
5.4 3.4 1.5 0.4 I. setosa
5.2 4.1 1.5 0.1 I. setosa
5.5 4.2 1.4 0.2 I. setosa
4.9 3.1 1.5 0.2 I. setosa
5.0 3.2 1.2 0.2 I. setosa
5.5 3.5 1.3 0.2 I. setosa
4.9 3.6 1.4 0.1 I. setosa
4.4 3.0 1.3 0.2 I. setosa
5.1 3.4 1.5 0.2 I. setosa
5.0 3.5 1.3 0.3 I. setosa
4.5 2.3 1.3 0.3 I. setosa
4.4 3.2 1.3 0.2 I. setosa
5.0 3.5 1.6 0.6 I. setosa
5.1 3.8 1.9 0.4 I. setosa
4.8 3.0 1.4 0.3 I. setosa
5.1 3.8 1.6 0.2 I. setosa
4.6 3.2 1.4 0.2 I. setosa
5.3 3.7 1.5 0.2 I. setosa
5.0 3.3 1.4 0.2 I. setosa
7.0 3.2 4.7 1.4 I. versicolor
6.4 3.2 4.5 1.5 I. versicolor
6.9 3.1 4.9 1.5 I. versicolor
5.5 2.3 4.0 1.3 I. versicolor
6.5 2.8 4.6 1.5 I. versicolor
5.7 2.8 4.5 1.3 I. versicolor
6.3 3.3 4.7 1.6 I. versicolor
4.9 2.4 3.3 1.0 I. versicolor
6.6 2.9 4.6 1.3 I. versicolor
5.2 2.7 3.9 1.4 I. versicolor
5.0 2.0 3.5 1.0 I. versicolor
5.9 3.0 4.2 1.5 I. versicolor
6.0 2.2 4.0 1.0 I. versicolor
6.1 2.9 4.7 1.4 I. versicolor
5.6 2.9 3.6 1.3 I. versicolor
6.7 3.1 4.4 1.4 I. versicolor
5.6 3.0 4.5 1.5 I. versicolor
5.8 2.7 4.1 1.0 I. versicolor
6.2 2.2 4.5 1.5 I. versicolor
5.6 2.5 3.9 1.1 I. versicolor
5.9 3.2 4.8 1.8 I. versicolor
6.1 2.8 4.0 1.3 I. versicolor
6.3 2.5 4.9 1.5 I. versicolor
6.1 2.8 4.7 1.2 I. versicolor
6.4 2.9 4.3 1.3 I. versicolor
6.6 3.0 4.4 1.4 I. versicolor
6.8 2.8 4.8 1.4 I. versicolor
6.7 3.0 5.0 1.7 I. versicolor
6.0 2.9 4.5 1.5 I. versicolor
5.7 2.6 3.5 1.0 I. versicolor
5.5 2.4 3.8 1.1 I. versicolor
5.5 2.4 3.7 1.0 I. versicolor
5.8 2.7 3.9 1.2 I. versicolor
6.0 2.7 5.1 1.6 I. versicolor
5.4 3.0 4.5 1.5 I. versicolor
6.0 3.4 4.5 1.6 I. versicolor
6.7 3.1 4.7 1.5 I. versicolor
6.3 2.3 4.4 1.3 I. versicolor
5.6 3.0 4.1 1.3 I. versicolor
5.5 2.5 4.0 1.3 I. versicolor
5.5 2.6 4.4 1.2 I. versicolor
6.1 3.0 4.6 1.4 I. versicolor
5.8 2.6 4.0 1.2 I. versicolor
5.0 2.3 3.3 1.0 I. versicolor
5.6 2.7 4.2 1.3 I. versicolor
5.7 3.0 4.2 1.2 I. versicolor
5.7 2.9 4.2 1.3 I. versicolor
6.2 2.9 4.3 1.3 I. versicolor
5.1 2.5 3.0 1.1 I. versicolor
5.7 2.8 4.1 1.3 I. versicolor
6.3 3.3 6.0 2.5 I. virginica
5.8 2.7 5.1 1.9 I. virginica
7.1 3.0 5.9 2.1 I. virginica
6.3 2.9 5.6 1.8 I. virginica
6.5 3.0 5.8 2.2 I. virginica
7.6 3.0 6.6 2.1 I. virginica
4.9 2.5 4.5 1.7 I. virginica
7.3 2.9 6.3 1.8 I. virginica
6.7 2.5 5.8 1.8 I. virginica
7.2 3.6 6.1 2.5 I. virginica
6.5 3.2 5.1 2.0 I. virginica
6.4 2.7 5.3 1.9 I. virginica
6.8 3.0 5.5 2.1 I. virginica
5.7 2.5 5.0 2.0 I. virginica
5.8 2.8 5.1 2.4 I. virginica
6.4 3.2 5.3 2.3 I. virginica
6.5 3.0 5.5 1.8 I. virginica
7.7 3.8 6.7 2.2 I. virginica
7.7 2.6 6.9 2.3 I. virginica
6.0 2.2 5.0 1.5 I. virginica
6.9 3.2 5.7 2.3 I. virginica
5.6 2.8 4.9 2.0 I. virginica
7.7 2.8 6.7 2.0 I. virginica
6.3 2.7 4.9 1.8 I. virginica
6.7 3.3 5.7 2.1 I. virginica
7.2 3.2 6.0 1.8 I. virginica
6.2 2.8 4.8 1.8 I. virginica
6.1 3.0 4.9 1.8 I. virginica
6.4 2.8 5.6 2.1 I. virginica
7.2 3.0 5.8 1.6 I. virginica
7.4 2.8 6.1 1.9 I. virginica
7.9 3.8 6.4 2.0 I. virginica
6.4 2.8 5.6 2.2 I. virginica
6.3 2.8 5.1 1.5 I. virginica
6.1 2.6 5.6 1.4 I. virginica
7.7 3.0 6.1 2.3 I. virginica
6.3 3.4 5.6 2.4 I. virginica
6.4 3.1 5.5 1.8 I. virginica
6.0 3.0 4.8 1.8 I. virginica
6.9 3.1 5.4 2.1 I. virginica
6.7 3.1 5.6 2.4 I. virginica
6.9 3.1 5.1 2.3 I. virginica
5.8 2.7 5.1 1.9 I. virginica
6.8 3.2 5.9 2.3 I. virginica
6.7 3.3 5.7 2.5 I. virginica
6.7 3.0 5.2 2.3 I. virginica
6.3 2.5 5.0 1.9 I. virginica
6.5 3.0 5.2 2.0 I. virginica
6.2 3.4 5.4 2.3 I. virginica
5.9 3.0 5.1 1.8 I. virginica
run;
proc fastclus data=iris maxc=2 maxiter=10 out=clus;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;

proc sort data=iris;
  by descending sepallength sepalwidth;
run;
proc fastclus data=iris maxc=2 maxiter=10 out=clus2;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
pacinoxl1212
Calcite | Level 5

Thanks Mark.. I appreciate the response. That's exactly what I meant. Do you know of a way to retain the same properties? Is there a segmentation logic which I can leverage?

 

 

mkeintz
PROC Star

Why not make 6 synthetic customers from the first FASTCLUS, each with clustering vars at the centroid of the corresponding cluster.  You can use the MEAN= option to output the cluster centers to a dataset, as I've done with MY_CLUSTER_CENTERS below.

 

Then Include those 6 in the second FASTCLUS (PROC APPEND).   Find them in the fastclus results, (they'lll be the last 6 observatins in the dataset) and then you'll know how to recode your clusterid's (the ID's from the appended obs are the original cluster id's).  .

 

Editted addition: this all assumes sufficient stability in your changing data sets such that you are confident that the nature of the 6 clusters does not change over time

 

In the case of the IRIS data, with 2 clusters it would be this:

 

/* Make the first data set - if it doesn't have an ID var, make one */
data iris;
   title 'Fisher (1936) Iris Data';
   input SepalLength SepalWidth PetalLength PetalWidth Species $12.;
   label SepalLength='Sepal Length in mm.'
         SepalWidth ='Sepal Width in mm.'
         PetalLength='Petal Length in mm.'
         PetalWidth ='Petal Width in mm.';
   id=1000000+_n_;
datalines;
5.1 3.5 1.4 0.2 I. setosa 
4.9 3.0 1.4 0.2 I. setosa 
4.7 3.2 1.3 0.2 I. setosa 
4.6 3.1 1.5 0.2 I. setosa 
5.0 3.6 1.4 0.3 I. setosa 
5.4 3.9 1.7 0.4 I. setosa 
4.6 3.4 1.4 0.3 I. setosa 
5.0 3.4 1.5 0.2 I. setosa 
4.4 2.9 1.4 0.2 I. setosa 
4.9 3.1 1.5 0.1 I. setosa 
5.4 3.7 1.5 0.2 I. setosa 
4.8 3.4 1.6 0.2 I. setosa 
4.8 3.0 1.4 0.1 I. setosa 
4.3 3.0 1.1 0.1 I. setosa 
5.8 4.0 1.2 0.2 I. setosa 
5.7 4.4 1.5 0.4 I. setosa 
5.4 3.9 1.3 0.4 I. setosa 
5.1 3.5 1.4 0.3 I. setosa 
5.7 3.8 1.7 0.3 I. setosa 
5.1 3.8 1.5 0.3 I. setosa 
5.4 3.4 1.7 0.2 I. setosa 
5.1 3.7 1.5 0.4 I. setosa 
4.6 3.6 1.0 0.2 I. setosa 
5.1 3.3 1.7 0.5 I. setosa 
4.8 3.4 1.9 0.2 I. setosa 
5.0 3.0 1.6 0.2 I. setosa 
5.0 3.4 1.6 0.4 I. setosa 
5.2 3.5 1.5 0.2 I. setosa 
5.2 3.4 1.4 0.2 I. setosa 
4.7 3.2 1.6 0.2 I. setosa 
4.8 3.1 1.6 0.2 I. setosa 
5.4 3.4 1.5 0.4 I. setosa 
5.2 4.1 1.5 0.1 I. setosa 
5.5 4.2 1.4 0.2 I. setosa 
4.9 3.1 1.5 0.2 I. setosa 
5.0 3.2 1.2 0.2 I. setosa 
5.5 3.5 1.3 0.2 I. setosa 
4.9 3.6 1.4 0.1 I. setosa 
4.4 3.0 1.3 0.2 I. setosa 
5.1 3.4 1.5 0.2 I. setosa 
5.0 3.5 1.3 0.3 I. setosa 
4.5 2.3 1.3 0.3 I. setosa 
4.4 3.2 1.3 0.2 I. setosa 
5.0 3.5 1.6 0.6 I. setosa 
5.1 3.8 1.9 0.4 I. setosa 
4.8 3.0 1.4 0.3 I. setosa 
5.1 3.8 1.6 0.2 I. setosa 
4.6 3.2 1.4 0.2 I. setosa 
5.3 3.7 1.5 0.2 I. setosa 
5.0 3.3 1.4 0.2 I. setosa 
7.0 3.2 4.7 1.4 I. versicolor 
6.4 3.2 4.5 1.5 I. versicolor 
6.9 3.1 4.9 1.5 I. versicolor 
5.5 2.3 4.0 1.3 I. versicolor 
6.5 2.8 4.6 1.5 I. versicolor 
5.7 2.8 4.5 1.3 I. versicolor 
6.3 3.3 4.7 1.6 I. versicolor 
4.9 2.4 3.3 1.0 I. versicolor 
6.6 2.9 4.6 1.3 I. versicolor 
5.2 2.7 3.9 1.4 I. versicolor 
5.0 2.0 3.5 1.0 I. versicolor 
5.9 3.0 4.2 1.5 I. versicolor 
6.0 2.2 4.0 1.0 I. versicolor 
6.1 2.9 4.7 1.4 I. versicolor 
5.6 2.9 3.6 1.3 I. versicolor 
6.7 3.1 4.4 1.4 I. versicolor 
5.6 3.0 4.5 1.5 I. versicolor 
5.8 2.7 4.1 1.0 I. versicolor 
6.2 2.2 4.5 1.5 I. versicolor 
5.6 2.5 3.9 1.1 I. versicolor 
5.9 3.2 4.8 1.8 I. versicolor 
6.1 2.8 4.0 1.3 I. versicolor 
6.3 2.5 4.9 1.5 I. versicolor 
6.1 2.8 4.7 1.2 I. versicolor 
6.4 2.9 4.3 1.3 I. versicolor 
6.6 3.0 4.4 1.4 I. versicolor 
6.8 2.8 4.8 1.4 I. versicolor 
6.7 3.0 5.0 1.7 I. versicolor 
6.0 2.9 4.5 1.5 I. versicolor 
5.7 2.6 3.5 1.0 I. versicolor 
5.5 2.4 3.8 1.1 I. versicolor 
5.5 2.4 3.7 1.0 I. versicolor 
5.8 2.7 3.9 1.2 I. versicolor 
6.0 2.7 5.1 1.6 I. versicolor 
5.4 3.0 4.5 1.5 I. versicolor 
6.0 3.4 4.5 1.6 I. versicolor 
6.7 3.1 4.7 1.5 I. versicolor 
6.3 2.3 4.4 1.3 I. versicolor 
5.6 3.0 4.1 1.3 I. versicolor 
5.5 2.5 4.0 1.3 I. versicolor 
5.5 2.6 4.4 1.2 I. versicolor 
6.1 3.0 4.6 1.4 I. versicolor 
5.8 2.6 4.0 1.2 I. versicolor 
5.0 2.3 3.3 1.0 I. versicolor 
5.6 2.7 4.2 1.3 I. versicolor 
5.7 3.0 4.2 1.2 I. versicolor 
5.7 2.9 4.2 1.3 I. versicolor 
6.2 2.9 4.3 1.3 I. versicolor 
5.1 2.5 3.0 1.1 I. versicolor 
5.7 2.8 4.1 1.3 I. versicolor 
6.3 3.3 6.0 2.5 I. virginica 
5.8 2.7 5.1 1.9 I. virginica 
7.1 3.0 5.9 2.1 I. virginica 
6.3 2.9 5.6 1.8 I. virginica 
6.5 3.0 5.8 2.2 I. virginica 
7.6 3.0 6.6 2.1 I. virginica 
4.9 2.5 4.5 1.7 I. virginica 
7.3 2.9 6.3 1.8 I. virginica 
6.7 2.5 5.8 1.8 I. virginica 
7.2 3.6 6.1 2.5 I. virginica 
6.5 3.2 5.1 2.0 I. virginica 
6.4 2.7 5.3 1.9 I. virginica 
6.8 3.0 5.5 2.1 I. virginica 
5.7 2.5 5.0 2.0 I. virginica 
5.8 2.8 5.1 2.4 I. virginica 
6.4 3.2 5.3 2.3 I. virginica 
6.5 3.0 5.5 1.8 I. virginica 
7.7 3.8 6.7 2.2 I. virginica 
7.7 2.6 6.9 2.3 I. virginica 
6.0 2.2 5.0 1.5 I. virginica 
6.9 3.2 5.7 2.3 I. virginica 
5.6 2.8 4.9 2.0 I. virginica 
7.7 2.8 6.7 2.0 I. virginica 
6.3 2.7 4.9 1.8 I. virginica 
6.7 3.3 5.7 2.1 I. virginica 
7.2 3.2 6.0 1.8 I. virginica 
6.2 2.8 4.8 1.8 I. virginica 
6.1 3.0 4.9 1.8 I. virginica 
6.4 2.8 5.6 2.1 I. virginica 
7.2 3.0 5.8 1.6 I. virginica 
7.4 2.8 6.1 1.9 I. virginica 
7.9 3.8 6.4 2.0 I. virginica 
6.4 2.8 5.6 2.2 I. virginica 
6.3 2.8 5.1 1.5 I. virginica 
6.1 2.6 5.6 1.4 I. virginica 
7.7 3.0 6.1 2.3 I. virginica 
6.3 3.4 5.6 2.4 I. virginica 
6.4 3.1 5.5 1.8 I. virginica 
6.0 3.0 4.8 1.8 I. virginica 
6.9 3.1 5.4 2.1 I. virginica 
6.7 3.1 5.6 2.4 I. virginica 
6.9 3.1 5.1 2.3 I. virginica 
5.8 2.7 5.1 1.9 I. virginica 
6.8 3.2 5.9 2.3 I. virginica 
6.7 3.3 5.7 2.5 I. virginica 
6.7 3.0 5.2 2.3 I. virginica 
6.3 2.5 5.0 1.9 I. virginica 
6.5 3.0 5.2 2.0 I. virginica 
6.2 3.4 5.4 2.3 I. virginica 
5.9 3.0 5.1 1.8 I. virginica 
run;

proc sort data=iris  out=iris2;
  by descending sepallength sepalwidth;
run;

/*cluster the first data set */
proc fastclus data=iris maxc=2 maxiter=10 out=clus mean=my_clustercenters  noprint;
  var SepalLength SepalWidth PetalLength PetalWidth;
run;

/* append cluster centroids to second data set, renaming CLUSTER to the id var */ 
proc append base=iris2 data=my_clustercenters (rename=(cluster=id)) force;
run;

proc fastclus data=iris2 maxc=2 maxiter=10 out=clus2 noprint;
  var SepalLength SepalWidth PetalLength PetalWidth;
run;

/* Find the centroids and set up the recode of new cluster results */
data final_clus;
  array cluster_lookup {2} _temporary_;
  if _n_=1 then do p=nrecs-1 to nrecs;
    set clus2 nobs=nrecs point=p; 
    cluster_lookup{cluster}=id; 
  end;
  set clus2;
  final_cluster=cluster_lookup{cluster};
  if _n_>=nrecs-1 then stop;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

ACtually, it also looks like the OUTSTAT and INSTAT options might work for you.  These options are defined as:

 

The INSTAT= option reads a SAS data set previously created with the FASTCLUS procedure by using the OUTSTAT= option. If you specify the INSTAT= option, no clustering iterations are performed and no output is produced. Only cluster assignment and imputation are performed as an OUT= data set is created.

 

It seems to me you could use OUTSTAT= in the first fastclus, and use it as INSTAT for the subsequent fastclus.

 

proc fastclus data=my_orig_data outstat=orig_stats  .... other options ....;

  var ....;

run;

 

Then

 

proc fastclus data=my_new_data instat=orig_stats  .... other options ....;

  var ....;

run;

 

 

This option assumes your first data set was representative of all the subsequent data sets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
pacinoxl1212
Calcite | Level 5

Fantastic!! thank you so much. I'll give this a try. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1464 views
  • 1 like
  • 2 in conversation