I have a table of items that were purchased by customers listed by customer IDs. There are 300 items and 2000 customers. The table looks like:
c_id | Item1 | Item2 | Item3 | ........ | Item300 |
c1 | 1 | 0 | 0 | ...... | 1 |
c2 | 0 | 1 | 1 | ....... | 0 |
... | ... | ... | ... | ... | ... |
c2000 | 1 | 0 | 0 | ... | 0 |
The 0s and 1s indicate which items were purchased by certain customer. e.g. c1 purchased items 1, (may other items between 4 and 299), and 300. My goal is to get the counts of customers that have different item purchasing combinations. e.g. a result table like:
Item_purchased | Customer_count |
Item1; Item5; Item6; Item300 | 45 |
Item3; Item200 | 20 |
... | ... |
Item4; Item100; Item299 | 5 |
So, the total of column Customer_count should be 2000. Since there are too many items, i cannot find a easy way to find those item combinations and put them in a table.
Not sure my question is clear or not. Can anyone help me with some ideas?
Thanks,
This should get you close to the expected table:
data combined;
set have;
length item_purchased $ 1000;
array items laptop -- keyboard;
do i = 1 to dim(items);
if items[i] then do;
item_purchased = catx(';', item_purchased, vname(items[i]));
end;
end;
keep item_purchased;
run;
proc freq data=combined;
table item_purchased / nocum nofreq nopercent;
run;
EDIT: should have reloaded the topic before posting, to avoid cloning what @ballardw already posted.
To make my question clear. Consider the following small size sample with 5 items and 10 customers:
data have;
input c_id $ laptop TV cellphone monitor keyboard;
cards;
c1 1 0 0 0 1
c2 1 0 0 0 0
c3 1 0 0 0 0
c4 0 0 0 0 1
c5 1 0 0 0 1
c6 1 0 0 0 0
c7 0 0 0 0 1
c8 1 0 1 1 0
c9 1 0 1 1 0
c10 1 0 0 0 0
;
run;
So the result will be similar to the table below. It can be in other format as long as I can easily get the count of customer and items purchased begin identified.
item_purchased | Customer_count |
laptop | 4 |
laptop; keyboard | 2 |
keyboard | 2 |
laptop; cellphone; monitor | 2 |
This should get you close to the expected table:
data combined;
set have;
length item_purchased $ 1000;
array items laptop -- keyboard;
do i = 1 to dim(items);
if items[i] then do;
item_purchased = catx(';', item_purchased, vname(items[i]));
end;
end;
keep item_purchased;
run;
proc freq data=combined;
table item_purchased / nocum nofreq nopercent;
run;
EDIT: should have reloaded the topic before posting, to avoid cloning what @ballardw already posted.
Thank you! That works!
Like this?
data KEY;
length KEY $300;
set HAVE;
KEY=cats(of ITEM:);
run;
proc freq data=KEY;
table KEY;
run;
This gives you the count by unique combination.
You can parse the KEY variable to replace the value by a list of item numbers if you want.
Just for fun, what will you do if NONE of the combinations are duplicated? You have 2 to the 300th power possible combinations, that is roughly 2E90, or 2 followed by 90 zeroes, and only 2000 customers. That would make it not uncommon to have no duplicates.
Slight modification of @ChrisNZ to get the variable names into the key.
data want; set have; array v(*) laptop -- keyboard; length key $10000; /* for your full job*/ do i=1 to dim(v); if v[i]=1 then key=catx(', ',key,vname(v[i])); end; run; proc freq data=want; tables key; run;
Vname is a function that will return the name of a variable (if used correctly).
Thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.