Hi all,
I have a dataset of 1M+ store card account holders with the numeric variables:
I also have what card the account holder has (e.g. Premium Card). There are a total of 30 different cards in my dataset (I take a random sample of 100 to make a sample dataset of 3,000 observations).
My task is to find which cards, rather than card account holders, are similar to each other along five numeric metrics. Since the data set has mixed data types, I understand I need to use PROC DISTANCE (with METHOD=DGOWER) to create a distance matrix to then be used as an input to PROC CLUSTER.
My problem is this, I would like to retain as much data as possible and therefore not first produce summary statistics (either mean or median) along those five metrics by card but when doing this with the clustering procedure, I get as many clusters as there are account holders, in my sample, rather than as many as there are cards. Is there a way to get the clustering on card level rather than account holder level while still retaining account holder level data and therefore only getting a maximum number of clusters as there are cards (30)?
My code...
proc surveyselect data = dataset sampsize=100 out=sample;
strata card;
run;
proc distance data = sample method=dgower out=data_dgower;
var interval (spend transactions tenure calls app) ;
copy account_id card;
run;
NOTE: The standardization method is set to RANGE for interval and ordinal variables when METHOD=
GOWER or DGOWER is specified.
NOTE: The data set WORK.DATA_DGOWER has 3000 observations and 3002 variables.
proc cluster data = data_dgower method=ward outtree=tree_data noprint;
copy account_id card;
id card;
run;
NOTE: The data set WORK.TREE_DATA has 5999 observations and 13 variables.
With correspondence analysis you could use all of your data. But you would have to define categories for all your continuous variables so that the procedure can transform your data into a contingency table.
Here is an example mimicking your problem, with only 3 kinds of cards (major, prepaid, and store):
/* Fake random data */
data test;
call streaminit(879767);
do acc = 1 to 3000;
cardType = rand("table", 0.4, 0.4);
app = rand("table", 0.4);
tenure = rand("poisson", 10);
calls = rand("poisson", 3);
if cardType = 1 then do;
transac = rand("poisson", 10);
spend = rand("lognormal", log(100));
end;
else if cardType = 2 then do;
transac = rand("poisson", 20);
spend = rand("lognormal", log(1000));
end;
else do;
transac = rand("poisson", 50);
spend = rand("lognormal", log(2000));
end;
output;
end;
run;
/* Use formats to define categories */
proc format;
value tenure
0-12 = "new card"
13-36 = "mid time card"
37-high = "long time card";
value calls
0-2 = "few calls"
3-5 = "mid calls"
6-high = "lots of calls";
value transac
0-9 = "few transact"
10-49 = "mid transact"
50-high = "many transact";
value spend
0-100 = "low spend"
100-1000 = "mid spend"
1000-high = "high spend";
value cardType
1 = "major"
2 = "store"
3 = "prepaid";
value app
1 = "App"
2 = "No App";
run;
/* Perform simple correspondence analysis */
proc corresp data=test;
format cardType cardType. app app. tenure tenure. calls calls. transac transac. spend spend.;
tables cardType, app tenure calls transac spend;
run;
Thanks for your reply, PGStats, though I can't see how PROC CORRESP helps.
OK, the example only involves 3 card types, so it doesn't show any clustering at all. Now, imagine there are three card types with 10 cards of each type. I expanded the example with such cards (identified by variable cardId):
/* Fake random data */
data test;
call streaminit(879767);
do acc = 1 to 30000;
cardType = rand("table", 0.4, 0.4);
cardId = ((cardType - 1) * 10) + rand("integer", 10);
app = rand("table", 0.4);
tenure = rand("poisson", 10);
calls = rand("poisson", 3);
if cardType = 1 then do;
transac = rand("poisson", 10);
spend = rand("lognormal", log(100));
end;
else if cardType = 2 then do;
transac = rand("poisson", 20);
spend = rand("lognormal", log(1000));
end;
else do;
transac = rand("poisson", 50);
spend = rand("lognormal", log(2000));
end;
output;
end;
run;
/* Use formats to define categories */
proc format;
value tenure
0-12 = "new card"
13-36 = "mid time card"
37-high = "long time card";
value calls
0-2 = "few calls"
3-5 = "mid calls"
6-high = "lots of calls";
value transac
0-9 = "few transact"
10-49 = "mid transact"
50-high = "many transact";
value spend
0-100 = "low spend"
100-1000 = "mid spend"
1000-high = "high spend";
value app
1 = "App"
2 = "No App";
value cardType
1 = "major"
2 = "store"
3 = "prepaid";
value cardId
1 = "Major 01"
2 = "Major 02"
3 = "Major 03"
4 = "Major 04"
5 = "Major 05"
6 = "Major 06"
7 = "Major 07"
8 = "Major 08"
9 = "Major 09"
10 = "Major 10"
11 = "Store 11"
12 = "Store 12"
13 = "Store 13"
14 = "Store 14"
15 = "Store 15"
16 = "Store 16"
17 = "Store 17"
18 = "Store 18"
19 = "Store 19"
20 = "Store 20"
21 = "Prepaid 21"
22 = "Prepaid 22"
23 = "Prepaid 23"
24 = "Prepaid 24"
25 = "Prepaid 25"
26 = "Prepaid 26"
27 = "Prepaid 27"
28 = "Prepaid 28"
29 = "Prepaid 29"
30 = "Prepaid 30";
run;
/* Perform simple correspondence analysis */
proc corresp data=test;
format cardId cardId. app app. tenure tenure. calls calls. transac transac. spend spend.;
tables cardId, app tenure calls transac spend;
run;
Now, I guess you can see how the clustering of cards by card type is represented on the graph and how the angular (from the origin) proximity of explanatory categories shows their relationship with clusters.
Correspondence analysis is not yet very popular in the USA, but it has been in France, Japan, and elsewhere in the world under different names for a long time, especially in marketing research.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.