Help using Base SAS procedures

Finding Most Common Combination

Reply
Occasional Contributor
Posts: 9

Finding Most Common Combination

Greetings,

First post on SAS Communities!

Have a dataset with Customer ID, PURCHASE DATE and PRODUCT CODE. A customer can have 1 or many products, purchased over any time period. I am trying to figure out the most popular (or most common) combination of products purchased by customers. It doesn't necessarily have to be a combination of 2 products. Can be 2 or 3 or 4 or any number of products based on frequency.

For example, customer 416002285459214 purchased products PROD17 and PROD18. Same is true for customer 416039285459214 but this customer also purchased PROD02.

I'd want to show total frequency of customers who purchased PROD17 and PROD18 and then separately show frequency of customers who purchased PROD17, PROD18 and PROD02.

In the end, I'd want to know the top 5 or top 10 combinations whether they're 2-product or 3-product or 4-product combinations etc. Hope I've clearly explained the problem at hand and definitely hope the sample data would help.

BTW, my actual data has millions of customers with hundreds of products.

Thanks everyone.

Would've liked to attach a file, but surprisingly the attachments are limited to 1K. I'm not sure how much data one can get stored on a 1K file.

Customer_ID PURCHASE_DATE PRODUCT_CODE

4160012851159850 04DEC2014 PROD14

4160012851159850 18OCT2014 PROD17

416002285459214 18OCT2014 PROD17

416002285459214 11JAN2015 PROD18

416003285759484 20DEC2014 PROD13

416003285759484 11JAN2015 PROD14

4160042851059757 22FEB2015 PROD02

4160042851059757 18OCT2014 PROD26

4160042851059757 18MAY2015 PROD27

4160053041360036 18MAY2015 PROD02

4160053041360036 20OCT2014 PROD17

4160062851159850 04DEC2014 PROD04

4160062851159850 22FEB2015 PROD28

4160062851159850 03APR2014 PROD30

416007266559304 20OCT2014 PROD17

416007266559304 04DEC2014 PROD18

416008304959670 20OCT2014 PROD26

416008304959670 20OCT2014 PROD27

416009285859574 12SEP2014 PROD13

416009285859574 30APR2015 PROD14

416009285859574 03APR2014 PROD17

416010285859574 12SEP2014 PROD26

416010285859574 12SEP2014 PROD27

416011247659385 29JUN2014 PROD26

416011247659385 03APR2014 PROD27

4160123041360036 18MAY2015 PROD03

4160123041360036 20DEC2014 PROD28

4160123041360036 29JUN2014 PROD30

4160132661460123 30APR2015 PROD03

4160132661460123 29JUN2014 PROD28

416014285459214 18OCT2014 PROD26

416014285459214 18OCT2014 PROD27

416015304959670 20OCT2014 PROD25

416015304959670 12SEP2014 PROD28

416016304959670 20OCT2014 PROD06

416016304959670 03APR2014 PROD25

4160172851159850 04DEC2014 PROD25

4160172851159850 03APR2014 PROD26

4160302851159850 20OCT2014 PROD28

4160302851159850 18OCT2014 PROD30

416031266559304 20OCT2014 PROD28

416031266559304 03APR2014 PROD30

416032285459214 18OCT2014 PROD02

416032285459214 20OCT2014 PROD30

4160332851159850 04DEC2014 PROD03

4160332851159850 18OCT2014 PROD30

4160343041360036 18MAY2015 PROD03

4160343041360036 20DEC2014 PROD30

4160352661460123 30APR2015 PROD19

4160352661460123 03APR2014 PROD28

416036285459214 18OCT2014 PROD26

416036285459214 12SEP2014 PROD27

4160373041559124 03APR2014 PROD02

4160373041559124 20DEC2014 PROD13

416038285759484 20DEC2014 PROD13

416038285759484 03APR2014 PROD14

416038285759484 22FEB2015 PROD17

416039285459214 18OCT2014 PROD02

416039285459214 03APR2014 PROD17

416039285459214 11JAN2015 PROD18

4160403041559124 03APR2014 PROD13

4160403041559124 18OCT2014 PROD17

4160403041559124 11JAN2015 PROD19

4160412851059757 22FEB2015 PROD08

4160412851059757 18OCT2014 PROD28

4160422851159850 04DEC2014 PROD03

4160422851159850 03APR2014 PROD28

4160432851159850 04DEC2014 PROD03

4160432851159850 22FEB2015 PROD17

4160442851059757 22FEB2015 PROD26

4160442851059757 18OCT2014 PROD27

4160453041259946 11JAN2015 PROD19

4160453041259946 03APR2014 PROD28

4160463041360036 18MAY2015 PROD04

4160463041360036 03APR2014 PROD28

416047285459214 18OCT2014 PROD17

416047285459214 11JAN2015 PROD18

4160483041360036 18MAY2015 PROD04

4160483041360036 12SEP2014 PROD28

4160492851159850 04DEC2014 PROD03

4160492851159850 18OCT2014 PROD28

4160492851159850 03APR2014 PROD30

416050247659385 29JUN2014 PROD13

Super User
Posts: 19,855

Re: Finding Most Common Combination

Posted in reply to BecomingSASsy

This type of analysis is called Market Basket Analysis (MBA) - if you search there are several solutions on here. LexJansen.com also has a paper which includes a macro for MBA in Base SAS

Do you have SAS Enterprise Miner? If in EM you can use the MBA node.

Occasional Contributor
Posts: 9

Re: Finding Most Common Combination

Thank you Reeza.

Did find a SUGI document on MBA.

http://www2.sas.com/proceedings/sugi28/223-28.pdf

Regular Contributor
Posts: 170

Re: Finding Most Common Combination

Posted in reply to BecomingSASsy

Related, and maybe useful, Stanford University is offering a MOOC in September , "Mining Massive Datasets", by professors Jure Leskovec, Anand Rajaraman and Jeff Ullman.  (https://www.coursera.org/course/mmds) which is a survey of such methods.  They cover this topic in chapter 6 of their book, offered free to download by publisher, links at http://www.mmds.org/.

Valued Guide
Posts: 860

Re: Finding Most Common Combination

Posted in reply to BecomingSASsy

See if this helps:

data have;

infile cards;

informat Customer_ID $20. PURCHASE_DATE date9. PRODUCT_CODE $6.;

format Customer_ID $20. PURCHASE_DATE date9. PRODUCT_CODE $6.;

input Customer_ID PURCHASE_DATE PRODUCT_CODE;

cards;

4160012851159850 04DEC2014 PROD14

4160012851159850 18OCT2014 PROD17

416002285459214 18OCT2014 PROD17

416002285459214 11JAN2015 PROD18

416003285759484 20DEC2014 PROD13

416003285759484 11JAN2015 PROD14

4160042851059757 22FEB2015 PROD02

4160042851059757 18OCT2014 PROD26

4160042851059757 18MAY2015 PROD27

4160053041360036 18MAY2015 PROD02

4160053041360036 20OCT2014 PROD17

4160062851159850 04DEC2014 PROD04

4160062851159850 22FEB2015 PROD28

4160062851159850 03APR2014 PROD30

416007266559304 20OCT2014 PROD17

416007266559304 04DEC2014 PROD18

416008304959670 20OCT2014 PROD26

416008304959670 20OCT2014 PROD27

416009285859574 12SEP2014 PROD13

416009285859574 30APR2015 PROD14

416009285859574 03APR2014 PROD17

416010285859574 12SEP2014 PROD26

416010285859574 12SEP2014 PROD27

416011247659385 29JUN2014 PROD26

416011247659385 03APR2014 PROD27

4160123041360036 18MAY2015 PROD03

4160123041360036 20DEC2014 PROD28

4160123041360036 29JUN2014 PROD30

4160132661460123 30APR2015 PROD03

4160132661460123 29JUN2014 PROD28

416014285459214 18OCT2014 PROD26

416014285459214 18OCT2014 PROD27

416015304959670 20OCT2014 PROD25

416015304959670 12SEP2014 PROD28

416016304959670 20OCT2014 PROD06

416016304959670 03APR2014 PROD25

4160172851159850 04DEC2014 PROD25

4160172851159850 03APR2014 PROD26

4160302851159850 20OCT2014 PROD28

4160302851159850 18OCT2014 PROD30

416031266559304 20OCT2014 PROD28

416031266559304 03APR2014 PROD30

416032285459214 18OCT2014 PROD02

416032285459214 20OCT2014 PROD30

4160332851159850 04DEC2014 PROD03

4160332851159850 18OCT2014 PROD30

4160343041360036 18MAY2015 PROD03

4160343041360036 20DEC2014 PROD30

4160352661460123 30APR2015 PROD19

4160352661460123 03APR2014 PROD28

416036285459214 18OCT2014 PROD26

416036285459214 12SEP2014 PROD27

4160373041559124 03APR2014 PROD02

4160373041559124 20DEC2014 PROD13

416038285759484 20DEC2014 PROD13

416038285759484 03APR2014 PROD14

416038285759484 22FEB2015 PROD17

416039285459214 18OCT2014 PROD02

416039285459214 03APR2014 PROD17

416039285459214 11JAN2015 PROD18

4160403041559124 03APR2014 PROD13

4160403041559124 18OCT2014 PROD17

4160403041559124 11JAN2015 PROD19

4160412851059757 22FEB2015 PROD08

4160412851059757 18OCT2014 PROD28

4160422851159850 04DEC2014 PROD03

4160422851159850 03APR2014 PROD28

4160432851159850 04DEC2014 PROD03

4160432851159850 22FEB2015 PROD17

4160442851059757 22FEB2015 PROD26

4160442851059757 18OCT2014 PROD27

4160453041259946 11JAN2015 PROD19

4160453041259946 03APR2014 PROD28

4160463041360036 18MAY2015 PROD04

4160463041360036 03APR2014 PROD28

416047285459214 18OCT2014 PROD17

416047285459214 11JAN2015 PROD18

4160483041360036 18MAY2015 PROD04

4160483041360036 12SEP2014 PROD28

4160492851159850 04DEC2014 PROD03

4160492851159850 18OCT2014 PROD28

4160492851159850 03APR2014 PROD30

416050247659385 29JUN2014 PROD13

;

proc sort data=have;by Customer_ID product_code;

proc transpose data=prep out=tran_prep (drop=_NAME_);by customer_id;var product_code;

proc sql;

create table want as

select distinct catx('-',col1,col2,col3) as Products,count(calculated products) as count

from tran_prep

group by products

order by count desc;

Occasional Contributor
Posts: 9

Re: Finding Most Common Combination

Posted in reply to Steelers_In_DC

Thank you for the reply Mark Johnson.

For some reason, my SAS froze when I tried running the transpose procedure ( I did change the prep to have).

Having said that, I have already tried a similar approach. Instead of using var product_code, I transposed by id product_code and inserted a 1 where a customer had a product. So essentially, my dataset contained a column for Customer_ID followed by 30 columns for products, which had a 1 every time a customer had that product. I have more than 30 products but limited it to 30 for this method.

Then I concatenated all the product fields (after changing the missing values to 0) and ended up with a bunch of different combinations like this.

PRODUCT_COMBOCUSTOMERS
00000000000000000000000001100021168
00000000000011000000000000000010613
00100000000000000000000000000110247
0000000000000000000000000001019597
0010000000000000000000000001005126
0010000000000000100000000000004805
0010000000000000000000000001013227
0000000000000000010000000001002838
0000001100000000000000000000002565
0000000000000000110000000000002500
0100000000000000100000000000002116
0000000000000000001000000001002085
0100000000000000000000000001001901
0000000000000000101000000000001789
0100000000000000000000000110001668
0001000000000000000000000001001659
0100000000000000000000000100001570
0001000000000000100000000000001548
0100000000000000000000000010001221
000000000000100010000000000000932
010000000000000000000000000001880
000000000000000000000000010100840
010000000000110000000000000000836
000000000000000001000000000001829
000000000000010010000000000000781
000000000000110010000000000000766
010000000000100000000000000000757
000000000000000000000000001100737
000000000000000001000000000101663
010000000000010000000000000000630
000000000000000000000000011100629
000010000000000000000000000010616
000000000000000000000000010001557
011000000000000000000000000000555
000100000000000000000000000001551
001000000000000001000000000000537
000001000000000000000000100000473
010000000000000000000000000101471
000100000000000000000000000101457
000000000000000000000000100100451
011000000000000000000000000001448
011000000000000010000000000000439
001000000000000000000100000000421
000000000000000110000000000000402
000000000000000000000000001001392
001000000000000001000000000100386
001000000000000011000000000000370
000000010000000010000000000000336
000000000000000000100000000101332
011000000000000000000000000100320
001000000000000001000000000101319
000000000000000000000000011001287
000000000000000000100000000001264
011000000000000000000000000101258
010000000000000001000000000000253
001000000000000001000000000001232
000001000000000100000000000000230
010000000000000011000000000000216
000000000000000001000100000000215
010000000000000000100000000000184
000000010000000000000000000100176
001000000000000000100000000100174
010000000000000010100000000000172
001000000000000000100000000000170
010000000000000001000000000100170
010000000000000000100000000100164
010100000000000010000000000000160
001000000000000010100000000000156
000001010000000000000000000000150
000001000000000000000000000100145
000001000000000010000000000000134
001100000000000000000000000000133
000000010000000000000000000000132
001100000000000000000000000100124
010100000000000000000000000000123
000000000000000000010100000000117
000000000000000011100000000000117
000000010000100000000000000000116
000000000000000001100000000000112
010100000000000000000000000100112
000010000000001000000000000000107
001100000000000010000000000000106
000100000000000010100000000000103
00000000000000000000000001010197
00100000000000000010000000010196
00110000000000000000000000000196
00010000000000000100000000000095
00000001000000000000000001000094
00100000000000000000000001000094
01000000000011001000000000000094
00010000000000000010000000000090
00000000000000000000000000110189
00010000000000000010000000010088
00010000000000001100000000000088
00000000000000000110000000010083
00010000000000000100000000010083
00110000000000000000000000010183
00000000000000000010000000100082
00000000000010000100000000000080
00000001000001000000000000000080

Since my product fields were sorted, I know, for example, that 21,168 Customers had product 26 and 27 (the first row in the table above).

I'm not sure if this is the most efficient way of doing it though.

Super User
Posts: 10,044

Re: Finding Most Common Combination

Posted in reply to BecomingSASsy

Yeah. I agree with @Astounding . You only need to check the combination with two or three members.

If a combination of 2 has the highest frequency , then that has the highest frequency in all the combinations .

Xia Keshan

Super User
Posts: 5,516

Re: Finding Most Common Combination

Posted in reply to BecomingSASsy

One thing to keep in mind as you consider how to approach the problem:  you only need to find the two-product combinations.  The way you have posed the problem, if a three-product combination has the highest frequency, there will be three two-product combinations that have the same frequency.

Once you get past that hurdle, you could still use your approach of constructing a long string of 0s and 1s.  But instead of outputting that when you hit the end of a customer's records, you could increment counters for each two-level combination that is indicated by the long string.

Good luck.

Occasional Contributor
Posts: 9

Re: Finding Most Common Combination

Posted in reply to Astounding

Sorry, I'm not following your suggestion. Let's take a look at the frequency table again with string of 0s and 1s to indicate product combinations.

The flaw or the limitation of this approach is that it treats every single combination separately. Whereas I would want to combine the 2 rows in red. Row 4 tells us that 9,597 customers have products 28 and 30. Row 7 tells us that a 3,227 customers have products 3, 28 and 30. So if I wanted to know how many customers had products 28 and 30, I have to add these 2 rows. There may be other rows with same product combination e.g. there may be X number of customers who purchased product 1, 28 and 30 etc. etc.

PRODUCT_COMBOCUSTOMERS
00000000000000000000000001100021168
00000000000011000000000000000010613
00100000000000000000000000000110247
0000000000000000000000000001019597
0010000000000000000000000001005126
0010000000000000100000000000004805
0010000000000000000000000001013227
0000000000000000010000000001002838
0000001100000000000000000000002565
0000000000000000110000000000002500
0100000000000000100000000000002116
0000000000000000001000000001002085
0100000000000000000000000001001901
0000000000000000101000000000001789
0100000000000000000000000110001668
0001000000000000000000000001001659
0100000000000000000000000100001570
0001000000000000100000000000001548
0100000000000000000000000010001221
000000000000100010000000000000932
010000000000000000000000000001880
000000000000000000000000010100840
010000000000110000000000000000836
000000000000000001000000000001829
000000000000010010000000000000781
000000000000110010000000000000766
010000000000100000000000000000757
000000000000000000000000001100737
000000000000000001000000000101663
010000000000010000000000000000630
000000000000000000000000011100629
000010000000000000000000000010616
000000000000000000000000010001557
011000000000000000000000000000555
000100000000000000000000000001551
001000000000000001000000000000537
000001000000000000000000100000473
010000000000000000000000000101471
000100000000000000000000000101457
000000000000000000000000100100451
011000000000000000000000000001448
011000000000000010000000000000439
001000000000000000000100000000421
000000000000000110000000000000402
000000000000000000000000001001392
001000000000000001000000000100386
001000000000000011000000000000370
000000010000000010000000000000336
000000000000000000100000000101332
011000000000000000000000000100320
001000000000000001000000000101319
000000000000000000000000011001287
000000000000000000100000000001264
011000000000000000000000000101258
010000000000000001000000000000253
001000000000000001000000000001232
000001000000000100000000000000230
010000000000000011000000000000216
000000000000000001000100000000215
010000000000000000100000000000184
000000010000000000000000000100176
001000000000000000100000000100174
010000000000000010100000000000172
001000000000000000100000000000170
010000000000000001000000000100170
010000000000000000100000000100164
010100000000000010000000000000160
001000000000000010100000000000156
000001010000000000000000000000150
000001000000000000000000000100145
000001000000000010000000000000134
001100000000000000000000000000133
000000010000000000000000000000132
001100000000000000000000000100124
010100000000000000000000000000123
000000000000000000010100000000117
000000000000000011100000000000117
000000010000100000000000000000116
000000000000000001100000000000112
010100000000000000000000000100112
000010000000001000000000000000107
001100000000000010000000000000106
000100000000000010100000000000103
00000000000000000000000001010197
00100000000000000010000000010196
00110000000000000000000000000196
00010000000000000100000000000095
00000001000000000000000001000094
00100000000000000000000001000094
01000000000011001000000000000094
00010000000000000010000000000090
00000000000000000000000000110189
00010000000000000010000000010088
00010000000000001100000000000088
00000000000000000110000000010083
00010000000000000100000000010083
00110000000000000000000000010183
00000000000000000010000000100082
00000000000010000100000000000080
000000010000010000000000000000

80

Super User
Posts: 5,516

Re: Finding Most Common Combination

Posted in reply to BecomingSASsy

Just to give you one example, suppose products 3, 28, and 30 have the highest 3-way frequency.  Then the 2-way combination of products 3 and 28 must have at least as high a frequency (same for the combination of products 3 and 30, and products 28 and 30).

Ask a Question
Discussion stats
  • 9 replies
  • 821 views
  • 0 likes
  • 6 in conversation