Greetings,
First post on SAS Communities!
Have a dataset with Customer ID, PURCHASE DATE and PRODUCT CODE. A customer can have 1 or many products, purchased over any time period. I am trying to figure out the most popular (or most common) combination of products purchased by customers. It doesn't necessarily have to be a combination of 2 products. Can be 2 or 3 or 4 or any number of products based on frequency.
For example, customer 416002285459214 purchased products PROD17 and PROD18. Same is true for customer 416039285459214 but this customer also purchased PROD02.
I'd want to show total frequency of customers who purchased PROD17 and PROD18 and then separately show frequency of customers who purchased PROD17, PROD18 and PROD02.
In the end, I'd want to know the top 5 or top 10 combinations whether they're 2-product or 3-product or 4-product combinations etc. Hope I've clearly explained the problem at hand and definitely hope the sample data would help.
BTW, my actual data has millions of customers with hundreds of products.
Thanks everyone.
Would've liked to attach a file, but surprisingly the attachments are limited to 1K. I'm not sure how much data one can get stored on a 1K file.
Customer_ID PURCHASE_DATE PRODUCT_CODE
4160012851159850 04DEC2014 PROD14
4160012851159850 18OCT2014 PROD17
416002285459214 18OCT2014 PROD17
416002285459214 11JAN2015 PROD18
416003285759484 20DEC2014 PROD13
416003285759484 11JAN2015 PROD14
4160042851059757 22FEB2015 PROD02
4160042851059757 18OCT2014 PROD26
4160042851059757 18MAY2015 PROD27
4160053041360036 18MAY2015 PROD02
4160053041360036 20OCT2014 PROD17
4160062851159850 04DEC2014 PROD04
4160062851159850 22FEB2015 PROD28
4160062851159850 03APR2014 PROD30
416007266559304 20OCT2014 PROD17
416007266559304 04DEC2014 PROD18
416008304959670 20OCT2014 PROD26
416008304959670 20OCT2014 PROD27
416009285859574 12SEP2014 PROD13
416009285859574 30APR2015 PROD14
416009285859574 03APR2014 PROD17
416010285859574 12SEP2014 PROD26
416010285859574 12SEP2014 PROD27
416011247659385 29JUN2014 PROD26
416011247659385 03APR2014 PROD27
4160123041360036 18MAY2015 PROD03
4160123041360036 20DEC2014 PROD28
4160123041360036 29JUN2014 PROD30
4160132661460123 30APR2015 PROD03
4160132661460123 29JUN2014 PROD28
416014285459214 18OCT2014 PROD26
416014285459214 18OCT2014 PROD27
416015304959670 20OCT2014 PROD25
416015304959670 12SEP2014 PROD28
416016304959670 20OCT2014 PROD06
416016304959670 03APR2014 PROD25
4160172851159850 04DEC2014 PROD25
4160172851159850 03APR2014 PROD26
4160302851159850 20OCT2014 PROD28
4160302851159850 18OCT2014 PROD30
416031266559304 20OCT2014 PROD28
416031266559304 03APR2014 PROD30
416032285459214 18OCT2014 PROD02
416032285459214 20OCT2014 PROD30
4160332851159850 04DEC2014 PROD03
4160332851159850 18OCT2014 PROD30
4160343041360036 18MAY2015 PROD03
4160343041360036 20DEC2014 PROD30
4160352661460123 30APR2015 PROD19
4160352661460123 03APR2014 PROD28
416036285459214 18OCT2014 PROD26
416036285459214 12SEP2014 PROD27
4160373041559124 03APR2014 PROD02
4160373041559124 20DEC2014 PROD13
416038285759484 20DEC2014 PROD13
416038285759484 03APR2014 PROD14
416038285759484 22FEB2015 PROD17
416039285459214 18OCT2014 PROD02
416039285459214 03APR2014 PROD17
416039285459214 11JAN2015 PROD18
4160403041559124 03APR2014 PROD13
4160403041559124 18OCT2014 PROD17
4160403041559124 11JAN2015 PROD19
4160412851059757 22FEB2015 PROD08
4160412851059757 18OCT2014 PROD28
4160422851159850 04DEC2014 PROD03
4160422851159850 03APR2014 PROD28
4160432851159850 04DEC2014 PROD03
4160432851159850 22FEB2015 PROD17
4160442851059757 22FEB2015 PROD26
4160442851059757 18OCT2014 PROD27
4160453041259946 11JAN2015 PROD19
4160453041259946 03APR2014 PROD28
4160463041360036 18MAY2015 PROD04
4160463041360036 03APR2014 PROD28
416047285459214 18OCT2014 PROD17
416047285459214 11JAN2015 PROD18
4160483041360036 18MAY2015 PROD04
4160483041360036 12SEP2014 PROD28
4160492851159850 04DEC2014 PROD03
4160492851159850 18OCT2014 PROD28
4160492851159850 03APR2014 PROD30
416050247659385 29JUN2014 PROD13
This type of analysis is called Market Basket Analysis (MBA) - if you search there are several solutions on here. LexJansen.com also has a paper which includes a macro for MBA in Base SAS
Do you have SAS Enterprise Miner? If in EM you can use the MBA node.
Related, and maybe useful, Stanford University is offering a MOOC in September , "Mining Massive Datasets", by professors Jure Leskovec, Anand Rajaraman and Jeff Ullman. (https://www.coursera.org/course/mmds) which is a survey of such methods. They cover this topic in chapter 6 of their book, offered free to download by publisher, links at http://www.mmds.org/.
See if this helps:
data have;
infile cards;
informat Customer_ID $20. PURCHASE_DATE date9. PRODUCT_CODE $6.;
format Customer_ID $20. PURCHASE_DATE date9. PRODUCT_CODE $6.;
input Customer_ID PURCHASE_DATE PRODUCT_CODE;
cards;
4160012851159850 04DEC2014 PROD14
4160012851159850 18OCT2014 PROD17
416002285459214 18OCT2014 PROD17
416002285459214 11JAN2015 PROD18
416003285759484 20DEC2014 PROD13
416003285759484 11JAN2015 PROD14
4160042851059757 22FEB2015 PROD02
4160042851059757 18OCT2014 PROD26
4160042851059757 18MAY2015 PROD27
4160053041360036 18MAY2015 PROD02
4160053041360036 20OCT2014 PROD17
4160062851159850 04DEC2014 PROD04
4160062851159850 22FEB2015 PROD28
4160062851159850 03APR2014 PROD30
416007266559304 20OCT2014 PROD17
416007266559304 04DEC2014 PROD18
416008304959670 20OCT2014 PROD26
416008304959670 20OCT2014 PROD27
416009285859574 12SEP2014 PROD13
416009285859574 30APR2015 PROD14
416009285859574 03APR2014 PROD17
416010285859574 12SEP2014 PROD26
416010285859574 12SEP2014 PROD27
416011247659385 29JUN2014 PROD26
416011247659385 03APR2014 PROD27
4160123041360036 18MAY2015 PROD03
4160123041360036 20DEC2014 PROD28
4160123041360036 29JUN2014 PROD30
4160132661460123 30APR2015 PROD03
4160132661460123 29JUN2014 PROD28
416014285459214 18OCT2014 PROD26
416014285459214 18OCT2014 PROD27
416015304959670 20OCT2014 PROD25
416015304959670 12SEP2014 PROD28
416016304959670 20OCT2014 PROD06
416016304959670 03APR2014 PROD25
4160172851159850 04DEC2014 PROD25
4160172851159850 03APR2014 PROD26
4160302851159850 20OCT2014 PROD28
4160302851159850 18OCT2014 PROD30
416031266559304 20OCT2014 PROD28
416031266559304 03APR2014 PROD30
416032285459214 18OCT2014 PROD02
416032285459214 20OCT2014 PROD30
4160332851159850 04DEC2014 PROD03
4160332851159850 18OCT2014 PROD30
4160343041360036 18MAY2015 PROD03
4160343041360036 20DEC2014 PROD30
4160352661460123 30APR2015 PROD19
4160352661460123 03APR2014 PROD28
416036285459214 18OCT2014 PROD26
416036285459214 12SEP2014 PROD27
4160373041559124 03APR2014 PROD02
4160373041559124 20DEC2014 PROD13
416038285759484 20DEC2014 PROD13
416038285759484 03APR2014 PROD14
416038285759484 22FEB2015 PROD17
416039285459214 18OCT2014 PROD02
416039285459214 03APR2014 PROD17
416039285459214 11JAN2015 PROD18
4160403041559124 03APR2014 PROD13
4160403041559124 18OCT2014 PROD17
4160403041559124 11JAN2015 PROD19
4160412851059757 22FEB2015 PROD08
4160412851059757 18OCT2014 PROD28
4160422851159850 04DEC2014 PROD03
4160422851159850 03APR2014 PROD28
4160432851159850 04DEC2014 PROD03
4160432851159850 22FEB2015 PROD17
4160442851059757 22FEB2015 PROD26
4160442851059757 18OCT2014 PROD27
4160453041259946 11JAN2015 PROD19
4160453041259946 03APR2014 PROD28
4160463041360036 18MAY2015 PROD04
4160463041360036 03APR2014 PROD28
416047285459214 18OCT2014 PROD17
416047285459214 11JAN2015 PROD18
4160483041360036 18MAY2015 PROD04
4160483041360036 12SEP2014 PROD28
4160492851159850 04DEC2014 PROD03
4160492851159850 18OCT2014 PROD28
4160492851159850 03APR2014 PROD30
416050247659385 29JUN2014 PROD13
;
proc sort data=have;by Customer_ID product_code;
proc transpose data=prep out=tran_prep (drop=_NAME_);by customer_id;var product_code;
proc sql;
create table want as
select distinct catx('-',col1,col2,col3) as Products,count(calculated products) as count
from tran_prep
group by products
order by count desc;
Thank you for the reply Mark Johnson.
For some reason, my SAS froze when I tried running the transpose procedure ( I did change the prep to have).
Having said that, I have already tried a similar approach. Instead of using var product_code, I transposed by id product_code and inserted a 1 where a customer had a product. So essentially, my dataset contained a column for Customer_ID followed by 30 columns for products, which had a 1 every time a customer had that product. I have more than 30 products but limited it to 30 for this method.
Then I concatenated all the product fields (after changing the missing values to 0) and ended up with a bunch of different combinations like this.
PRODUCT_COMBO | CUSTOMERS |
000000000000000000000000011000 | 21168 |
000000000000110000000000000000 | 10613 |
001000000000000000000000000001 | 10247 |
000000000000000000000000000101 | 9597 |
001000000000000000000000000100 | 5126 |
001000000000000010000000000000 | 4805 |
001000000000000000000000000101 | 3227 |
000000000000000001000000000100 | 2838 |
000000110000000000000000000000 | 2565 |
000000000000000011000000000000 | 2500 |
010000000000000010000000000000 | 2116 |
000000000000000000100000000100 | 2085 |
010000000000000000000000000100 | 1901 |
000000000000000010100000000000 | 1789 |
010000000000000000000000011000 | 1668 |
000100000000000000000000000100 | 1659 |
010000000000000000000000010000 | 1570 |
000100000000000010000000000000 | 1548 |
010000000000000000000000001000 | 1221 |
000000000000100010000000000000 | 932 |
010000000000000000000000000001 | 880 |
000000000000000000000000010100 | 840 |
010000000000110000000000000000 | 836 |
000000000000000001000000000001 | 829 |
000000000000010010000000000000 | 781 |
000000000000110010000000000000 | 766 |
010000000000100000000000000000 | 757 |
000000000000000000000000001100 | 737 |
000000000000000001000000000101 | 663 |
010000000000010000000000000000 | 630 |
000000000000000000000000011100 | 629 |
000010000000000000000000000010 | 616 |
000000000000000000000000010001 | 557 |
011000000000000000000000000000 | 555 |
000100000000000000000000000001 | 551 |
001000000000000001000000000000 | 537 |
000001000000000000000000100000 | 473 |
010000000000000000000000000101 | 471 |
000100000000000000000000000101 | 457 |
000000000000000000000000100100 | 451 |
011000000000000000000000000001 | 448 |
011000000000000010000000000000 | 439 |
001000000000000000000100000000 | 421 |
000000000000000110000000000000 | 402 |
000000000000000000000000001001 | 392 |
001000000000000001000000000100 | 386 |
001000000000000011000000000000 | 370 |
000000010000000010000000000000 | 336 |
000000000000000000100000000101 | 332 |
011000000000000000000000000100 | 320 |
001000000000000001000000000101 | 319 |
000000000000000000000000011001 | 287 |
000000000000000000100000000001 | 264 |
011000000000000000000000000101 | 258 |
010000000000000001000000000000 | 253 |
001000000000000001000000000001 | 232 |
000001000000000100000000000000 | 230 |
010000000000000011000000000000 | 216 |
000000000000000001000100000000 | 215 |
010000000000000000100000000000 | 184 |
000000010000000000000000000100 | 176 |
001000000000000000100000000100 | 174 |
010000000000000010100000000000 | 172 |
001000000000000000100000000000 | 170 |
010000000000000001000000000100 | 170 |
010000000000000000100000000100 | 164 |
010100000000000010000000000000 | 160 |
001000000000000010100000000000 | 156 |
000001010000000000000000000000 | 150 |
000001000000000000000000000100 | 145 |
000001000000000010000000000000 | 134 |
001100000000000000000000000000 | 133 |
000000010000000000000000000000 | 132 |
001100000000000000000000000100 | 124 |
010100000000000000000000000000 | 123 |
000000000000000000010100000000 | 117 |
000000000000000011100000000000 | 117 |
000000010000100000000000000000 | 116 |
000000000000000001100000000000 | 112 |
010100000000000000000000000100 | 112 |
000010000000001000000000000000 | 107 |
001100000000000010000000000000 | 106 |
000100000000000010100000000000 | 103 |
000000000000000000000000010101 | 97 |
001000000000000000100000000101 | 96 |
001100000000000000000000000001 | 96 |
000100000000000001000000000000 | 95 |
000000010000000000000000010000 | 94 |
001000000000000000000000010000 | 94 |
010000000000110010000000000000 | 94 |
000100000000000000100000000000 | 90 |
000000000000000000000000001101 | 89 |
000100000000000000100000000100 | 88 |
000100000000000011000000000000 | 88 |
000000000000000001100000000100 | 83 |
000100000000000001000000000100 | 83 |
001100000000000000000000000101 | 83 |
000000000000000000100000001000 | 82 |
000000000000100001000000000000 | 80 |
000000010000010000000000000000 | 80 |
Since my product fields were sorted, I know, for example, that 21,168 Customers had product 26 and 27 (the first row in the table above).
I'm not sure if this is the most efficient way of doing it though.
Yeah. I agree with @Astounding . You only need to check the combination with two or three members.
If a combination of 2 has the highest frequency , then that has the highest frequency in all the combinations .
Xia Keshan
One thing to keep in mind as you consider how to approach the problem: you only need to find the two-product combinations. The way you have posed the problem, if a three-product combination has the highest frequency, there will be three two-product combinations that have the same frequency.
Once you get past that hurdle, you could still use your approach of constructing a long string of 0s and 1s. But instead of outputting that when you hit the end of a customer's records, you could increment counters for each two-level combination that is indicated by the long string.
Good luck.
Sorry, I'm not following your suggestion. Let's take a look at the frequency table again with string of 0s and 1s to indicate product combinations.
The flaw or the limitation of this approach is that it treats every single combination separately. Whereas I would want to combine the 2 rows in red. Row 4 tells us that 9,597 customers have products 28 and 30. Row 7 tells us that a 3,227 customers have products 3, 28 and 30. So if I wanted to know how many customers had products 28 and 30, I have to add these 2 rows. There may be other rows with same product combination e.g. there may be X number of customers who purchased product 1, 28 and 30 etc. etc.
PRODUCT_COMBO | CUSTOMERS |
000000000000000000000000011000 | 21168 |
000000000000110000000000000000 | 10613 |
001000000000000000000000000001 | 10247 |
000000000000000000000000000101 | 9597 |
001000000000000000000000000100 | 5126 |
001000000000000010000000000000 | 4805 |
001000000000000000000000000101 | 3227 |
000000000000000001000000000100 | 2838 |
000000110000000000000000000000 | 2565 |
000000000000000011000000000000 | 2500 |
010000000000000010000000000000 | 2116 |
000000000000000000100000000100 | 2085 |
010000000000000000000000000100 | 1901 |
000000000000000010100000000000 | 1789 |
010000000000000000000000011000 | 1668 |
000100000000000000000000000100 | 1659 |
010000000000000000000000010000 | 1570 |
000100000000000010000000000000 | 1548 |
010000000000000000000000001000 | 1221 |
000000000000100010000000000000 | 932 |
010000000000000000000000000001 | 880 |
000000000000000000000000010100 | 840 |
010000000000110000000000000000 | 836 |
000000000000000001000000000001 | 829 |
000000000000010010000000000000 | 781 |
000000000000110010000000000000 | 766 |
010000000000100000000000000000 | 757 |
000000000000000000000000001100 | 737 |
000000000000000001000000000101 | 663 |
010000000000010000000000000000 | 630 |
000000000000000000000000011100 | 629 |
000010000000000000000000000010 | 616 |
000000000000000000000000010001 | 557 |
011000000000000000000000000000 | 555 |
000100000000000000000000000001 | 551 |
001000000000000001000000000000 | 537 |
000001000000000000000000100000 | 473 |
010000000000000000000000000101 | 471 |
000100000000000000000000000101 | 457 |
000000000000000000000000100100 | 451 |
011000000000000000000000000001 | 448 |
011000000000000010000000000000 | 439 |
001000000000000000000100000000 | 421 |
000000000000000110000000000000 | 402 |
000000000000000000000000001001 | 392 |
001000000000000001000000000100 | 386 |
001000000000000011000000000000 | 370 |
000000010000000010000000000000 | 336 |
000000000000000000100000000101 | 332 |
011000000000000000000000000100 | 320 |
001000000000000001000000000101 | 319 |
000000000000000000000000011001 | 287 |
000000000000000000100000000001 | 264 |
011000000000000000000000000101 | 258 |
010000000000000001000000000000 | 253 |
001000000000000001000000000001 | 232 |
000001000000000100000000000000 | 230 |
010000000000000011000000000000 | 216 |
000000000000000001000100000000 | 215 |
010000000000000000100000000000 | 184 |
000000010000000000000000000100 | 176 |
001000000000000000100000000100 | 174 |
010000000000000010100000000000 | 172 |
001000000000000000100000000000 | 170 |
010000000000000001000000000100 | 170 |
010000000000000000100000000100 | 164 |
010100000000000010000000000000 | 160 |
001000000000000010100000000000 | 156 |
000001010000000000000000000000 | 150 |
000001000000000000000000000100 | 145 |
000001000000000010000000000000 | 134 |
001100000000000000000000000000 | 133 |
000000010000000000000000000000 | 132 |
001100000000000000000000000100 | 124 |
010100000000000000000000000000 | 123 |
000000000000000000010100000000 | 117 |
000000000000000011100000000000 | 117 |
000000010000100000000000000000 | 116 |
000000000000000001100000000000 | 112 |
010100000000000000000000000100 | 112 |
000010000000001000000000000000 | 107 |
001100000000000010000000000000 | 106 |
000100000000000010100000000000 | 103 |
000000000000000000000000010101 | 97 |
001000000000000000100000000101 | 96 |
001100000000000000000000000001 | 96 |
000100000000000001000000000000 | 95 |
000000010000000000000000010000 | 94 |
001000000000000000000000010000 | 94 |
010000000000110010000000000000 | 94 |
000100000000000000100000000000 | 90 |
000000000000000000000000001101 | 89 |
000100000000000000100000000100 | 88 |
000100000000000011000000000000 | 88 |
000000000000000001100000000100 | 83 |
000100000000000001000000000100 | 83 |
001100000000000000000000000101 | 83 |
000000000000000000100000001000 | 82 |
000000000000100001000000000000 | 80 |
000000010000010000000000000000 | 80 |
Just to give you one example, suppose products 3, 28, and 30 have the highest 3-way frequency. Then the 2-way combination of products 3 and 28 must have at least as high a frequency (same for the combination of products 3 and 30, and products 28 and 30).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.