Calcite | Level 5

## Finding Most Common Combination

Greetings,

First post on SAS Communities!

Have a dataset with Customer ID, PURCHASE DATE and PRODUCT CODE. A customer can have 1 or many products, purchased over any time period. I am trying to figure out the most popular (or most common) combination of products purchased by customers. It doesn't necessarily have to be a combination of 2 products. Can be 2 or 3 or 4 or any number of products based on frequency.

For example, customer 416002285459214 purchased products PROD17 and PROD18. Same is true for customer 416039285459214 but this customer also purchased PROD02.

I'd want to show total frequency of customers who purchased PROD17 and PROD18 and then separately show frequency of customers who purchased PROD17, PROD18 and PROD02.

In the end, I'd want to know the top 5 or top 10 combinations whether they're 2-product or 3-product or 4-product combinations etc. Hope I've clearly explained the problem at hand and definitely hope the sample data would help.

BTW, my actual data has millions of customers with hundreds of products.

Thanks everyone.

Would've liked to attach a file, but surprisingly the attachments are limited to 1K. I'm not sure how much data one can get stored on a 1K file.

Customer_ID PURCHASE_DATE PRODUCT_CODE

4160012851159850 04DEC2014 PROD14

4160012851159850 18OCT2014 PROD17

416002285459214 18OCT2014 PROD17

416002285459214 11JAN2015 PROD18

416003285759484 20DEC2014 PROD13

416003285759484 11JAN2015 PROD14

4160042851059757 22FEB2015 PROD02

4160042851059757 18OCT2014 PROD26

4160042851059757 18MAY2015 PROD27

4160053041360036 18MAY2015 PROD02

4160053041360036 20OCT2014 PROD17

4160062851159850 04DEC2014 PROD04

4160062851159850 22FEB2015 PROD28

4160062851159850 03APR2014 PROD30

416007266559304 20OCT2014 PROD17

416007266559304 04DEC2014 PROD18

416008304959670 20OCT2014 PROD26

416008304959670 20OCT2014 PROD27

416009285859574 12SEP2014 PROD13

416009285859574 30APR2015 PROD14

416009285859574 03APR2014 PROD17

416010285859574 12SEP2014 PROD26

416010285859574 12SEP2014 PROD27

416011247659385 29JUN2014 PROD26

416011247659385 03APR2014 PROD27

4160123041360036 18MAY2015 PROD03

4160123041360036 20DEC2014 PROD28

4160123041360036 29JUN2014 PROD30

4160132661460123 30APR2015 PROD03

4160132661460123 29JUN2014 PROD28

416014285459214 18OCT2014 PROD26

416014285459214 18OCT2014 PROD27

416015304959670 20OCT2014 PROD25

416015304959670 12SEP2014 PROD28

416016304959670 20OCT2014 PROD06

416016304959670 03APR2014 PROD25

4160172851159850 04DEC2014 PROD25

4160172851159850 03APR2014 PROD26

4160302851159850 20OCT2014 PROD28

4160302851159850 18OCT2014 PROD30

416031266559304 20OCT2014 PROD28

416031266559304 03APR2014 PROD30

416032285459214 18OCT2014 PROD02

416032285459214 20OCT2014 PROD30

4160332851159850 04DEC2014 PROD03

4160332851159850 18OCT2014 PROD30

4160343041360036 18MAY2015 PROD03

4160343041360036 20DEC2014 PROD30

4160352661460123 30APR2015 PROD19

4160352661460123 03APR2014 PROD28

416036285459214 18OCT2014 PROD26

416036285459214 12SEP2014 PROD27

4160373041559124 03APR2014 PROD02

4160373041559124 20DEC2014 PROD13

416038285759484 20DEC2014 PROD13

416038285759484 03APR2014 PROD14

416038285759484 22FEB2015 PROD17

416039285459214 18OCT2014 PROD02

416039285459214 03APR2014 PROD17

416039285459214 11JAN2015 PROD18

4160403041559124 03APR2014 PROD13

4160403041559124 18OCT2014 PROD17

4160403041559124 11JAN2015 PROD19

4160412851059757 22FEB2015 PROD08

4160412851059757 18OCT2014 PROD28

4160422851159850 04DEC2014 PROD03

4160422851159850 03APR2014 PROD28

4160432851159850 04DEC2014 PROD03

4160432851159850 22FEB2015 PROD17

4160442851059757 22FEB2015 PROD26

4160442851059757 18OCT2014 PROD27

4160453041259946 11JAN2015 PROD19

4160453041259946 03APR2014 PROD28

4160463041360036 18MAY2015 PROD04

4160463041360036 03APR2014 PROD28

416047285459214 18OCT2014 PROD17

416047285459214 11JAN2015 PROD18

4160483041360036 18MAY2015 PROD04

4160483041360036 12SEP2014 PROD28

4160492851159850 04DEC2014 PROD03

4160492851159850 18OCT2014 PROD28

4160492851159850 03APR2014 PROD30

416050247659385 29JUN2014 PROD13

9 REPLIES 9
Super User

## Re: Finding Most Common Combination

This type of analysis is called Market Basket Analysis (MBA) - if you search there are several solutions on here. LexJansen.com also has a paper which includes a macro for MBA in Base SAS

Do you have SAS Enterprise Miner? If in EM you can use the MBA node.

Calcite | Level 5

## Re: Finding Most Common Combination

Thank you Reeza.

Did find a SUGI document on MBA.

http://www2.sas.com/proceedings/sugi28/223-28.pdf

Rhodochrosite | Level 12

## Re: Finding Most Common Combination

Related, and maybe useful, Stanford University is offering a MOOC in September , "Mining Massive Datasets", by professors Jure Leskovec, Anand Rajaraman and Jeff Ullman.  (https://www.coursera.org/course/mmds) which is a survey of such methods.  They cover this topic in chapter 6 of their book, offered free to download by publisher, links at http://www.mmds.org/.

Barite | Level 11

## Re: Finding Most Common Combination

See if this helps:

data have;

infile cards;

informat Customer_ID \$20. PURCHASE_DATE date9. PRODUCT_CODE \$6.;

format Customer_ID \$20. PURCHASE_DATE date9. PRODUCT_CODE \$6.;

input Customer_ID PURCHASE_DATE PRODUCT_CODE;

cards;

4160012851159850 04DEC2014 PROD14

4160012851159850 18OCT2014 PROD17

416002285459214 18OCT2014 PROD17

416002285459214 11JAN2015 PROD18

416003285759484 20DEC2014 PROD13

416003285759484 11JAN2015 PROD14

4160042851059757 22FEB2015 PROD02

4160042851059757 18OCT2014 PROD26

4160042851059757 18MAY2015 PROD27

4160053041360036 18MAY2015 PROD02

4160053041360036 20OCT2014 PROD17

4160062851159850 04DEC2014 PROD04

4160062851159850 22FEB2015 PROD28

4160062851159850 03APR2014 PROD30

416007266559304 20OCT2014 PROD17

416007266559304 04DEC2014 PROD18

416008304959670 20OCT2014 PROD26

416008304959670 20OCT2014 PROD27

416009285859574 12SEP2014 PROD13

416009285859574 30APR2015 PROD14

416009285859574 03APR2014 PROD17

416010285859574 12SEP2014 PROD26

416010285859574 12SEP2014 PROD27

416011247659385 29JUN2014 PROD26

416011247659385 03APR2014 PROD27

4160123041360036 18MAY2015 PROD03

4160123041360036 20DEC2014 PROD28

4160123041360036 29JUN2014 PROD30

4160132661460123 30APR2015 PROD03

4160132661460123 29JUN2014 PROD28

416014285459214 18OCT2014 PROD26

416014285459214 18OCT2014 PROD27

416015304959670 20OCT2014 PROD25

416015304959670 12SEP2014 PROD28

416016304959670 20OCT2014 PROD06

416016304959670 03APR2014 PROD25

4160172851159850 04DEC2014 PROD25

4160172851159850 03APR2014 PROD26

4160302851159850 20OCT2014 PROD28

4160302851159850 18OCT2014 PROD30

416031266559304 20OCT2014 PROD28

416031266559304 03APR2014 PROD30

416032285459214 18OCT2014 PROD02

416032285459214 20OCT2014 PROD30

4160332851159850 04DEC2014 PROD03

4160332851159850 18OCT2014 PROD30

4160343041360036 18MAY2015 PROD03

4160343041360036 20DEC2014 PROD30

4160352661460123 30APR2015 PROD19

4160352661460123 03APR2014 PROD28

416036285459214 18OCT2014 PROD26

416036285459214 12SEP2014 PROD27

4160373041559124 03APR2014 PROD02

4160373041559124 20DEC2014 PROD13

416038285759484 20DEC2014 PROD13

416038285759484 03APR2014 PROD14

416038285759484 22FEB2015 PROD17

416039285459214 18OCT2014 PROD02

416039285459214 03APR2014 PROD17

416039285459214 11JAN2015 PROD18

4160403041559124 03APR2014 PROD13

4160403041559124 18OCT2014 PROD17

4160403041559124 11JAN2015 PROD19

4160412851059757 22FEB2015 PROD08

4160412851059757 18OCT2014 PROD28

4160422851159850 04DEC2014 PROD03

4160422851159850 03APR2014 PROD28

4160432851159850 04DEC2014 PROD03

4160432851159850 22FEB2015 PROD17

4160442851059757 22FEB2015 PROD26

4160442851059757 18OCT2014 PROD27

4160453041259946 11JAN2015 PROD19

4160453041259946 03APR2014 PROD28

4160463041360036 18MAY2015 PROD04

4160463041360036 03APR2014 PROD28

416047285459214 18OCT2014 PROD17

416047285459214 11JAN2015 PROD18

4160483041360036 18MAY2015 PROD04

4160483041360036 12SEP2014 PROD28

4160492851159850 04DEC2014 PROD03

4160492851159850 18OCT2014 PROD28

4160492851159850 03APR2014 PROD30

416050247659385 29JUN2014 PROD13

;

proc sort data=have;by Customer_ID product_code;

proc transpose data=prep out=tran_prep (drop=_NAME_);by customer_id;var product_code;

proc sql;

create table want as

select distinct catx('-',col1,col2,col3) as Products,count(calculated products) as count

from tran_prep

group by products

order by count desc;

Calcite | Level 5

## Re: Finding Most Common Combination

Thank you for the reply Mark Johnson.

For some reason, my SAS froze when I tried running the transpose procedure ( I did change the prep to have).

Having said that, I have already tried a similar approach. Instead of using var product_code, I transposed by id product_code and inserted a 1 where a customer had a product. So essentially, my dataset contained a column for Customer_ID followed by 30 columns for products, which had a 1 every time a customer had that product. I have more than 30 products but limited it to 30 for this method.

Then I concatenated all the product fields (after changing the missing values to 0) and ended up with a bunch of different combinations like this.

 PRODUCT_COMBO CUSTOMERS 000000000000000000000000011000 21168 000000000000110000000000000000 10613 001000000000000000000000000001 10247 000000000000000000000000000101 9597 001000000000000000000000000100 5126 001000000000000010000000000000 4805 001000000000000000000000000101 3227 000000000000000001000000000100 2838 000000110000000000000000000000 2565 000000000000000011000000000000 2500 010000000000000010000000000000 2116 000000000000000000100000000100 2085 010000000000000000000000000100 1901 000000000000000010100000000000 1789 010000000000000000000000011000 1668 000100000000000000000000000100 1659 010000000000000000000000010000 1570 000100000000000010000000000000 1548 010000000000000000000000001000 1221 000000000000100010000000000000 932 010000000000000000000000000001 880 000000000000000000000000010100 840 010000000000110000000000000000 836 000000000000000001000000000001 829 000000000000010010000000000000 781 000000000000110010000000000000 766 010000000000100000000000000000 757 000000000000000000000000001100 737 000000000000000001000000000101 663 010000000000010000000000000000 630 000000000000000000000000011100 629 000010000000000000000000000010 616 000000000000000000000000010001 557 011000000000000000000000000000 555 000100000000000000000000000001 551 001000000000000001000000000000 537 000001000000000000000000100000 473 010000000000000000000000000101 471 000100000000000000000000000101 457 000000000000000000000000100100 451 011000000000000000000000000001 448 011000000000000010000000000000 439 001000000000000000000100000000 421 000000000000000110000000000000 402 000000000000000000000000001001 392 001000000000000001000000000100 386 001000000000000011000000000000 370 000000010000000010000000000000 336 000000000000000000100000000101 332 011000000000000000000000000100 320 001000000000000001000000000101 319 000000000000000000000000011001 287 000000000000000000100000000001 264 011000000000000000000000000101 258 010000000000000001000000000000 253 001000000000000001000000000001 232 000001000000000100000000000000 230 010000000000000011000000000000 216 000000000000000001000100000000 215 010000000000000000100000000000 184 000000010000000000000000000100 176 001000000000000000100000000100 174 010000000000000010100000000000 172 001000000000000000100000000000 170 010000000000000001000000000100 170 010000000000000000100000000100 164 010100000000000010000000000000 160 001000000000000010100000000000 156 000001010000000000000000000000 150 000001000000000000000000000100 145 000001000000000010000000000000 134 001100000000000000000000000000 133 000000010000000000000000000000 132 001100000000000000000000000100 124 010100000000000000000000000000 123 000000000000000000010100000000 117 000000000000000011100000000000 117 000000010000100000000000000000 116 000000000000000001100000000000 112 010100000000000000000000000100 112 000010000000001000000000000000 107 001100000000000010000000000000 106 000100000000000010100000000000 103 000000000000000000000000010101 97 001000000000000000100000000101 96 001100000000000000000000000001 96 000100000000000001000000000000 95 000000010000000000000000010000 94 001000000000000000000000010000 94 010000000000110010000000000000 94 000100000000000000100000000000 90 000000000000000000000000001101 89 000100000000000000100000000100 88 000100000000000011000000000000 88 000000000000000001100000000100 83 000100000000000001000000000100 83 001100000000000000000000000101 83 000000000000000000100000001000 82 000000000000100001000000000000 80 000000010000010000000000000000 80

Since my product fields were sorted, I know, for example, that 21,168 Customers had product 26 and 27 (the first row in the table above).

I'm not sure if this is the most efficient way of doing it though.

Super User

## Re: Finding Most Common Combination

Yeah. I agree with @Astounding . You only need to check the combination with two or three members.

If a combination of 2 has the highest frequency , then that has the highest frequency in all the combinations .

Xia Keshan

PROC Star

## Re: Finding Most Common Combination

One thing to keep in mind as you consider how to approach the problem:  you only need to find the two-product combinations.  The way you have posed the problem, if a three-product combination has the highest frequency, there will be three two-product combinations that have the same frequency.

Once you get past that hurdle, you could still use your approach of constructing a long string of 0s and 1s.  But instead of outputting that when you hit the end of a customer's records, you could increment counters for each two-level combination that is indicated by the long string.

Good luck.

Calcite | Level 5

## Re: Finding Most Common Combination

Sorry, I'm not following your suggestion. Let's take a look at the frequency table again with string of 0s and 1s to indicate product combinations.

The flaw or the limitation of this approach is that it treats every single combination separately. Whereas I would want to combine the 2 rows in red. Row 4 tells us that 9,597 customers have products 28 and 30. Row 7 tells us that a 3,227 customers have products 3, 28 and 30. So if I wanted to know how many customers had products 28 and 30, I have to add these 2 rows. There may be other rows with same product combination e.g. there may be X number of customers who purchased product 1, 28 and 30 etc. etc.

 PRODUCT_COMBO CUSTOMERS 000000000000000000000000011000 21168 000000000000110000000000000000 10613 001000000000000000000000000001 10247 000000000000000000000000000101 9597 001000000000000000000000000100 5126 001000000000000010000000000000 4805 001000000000000000000000000101 3227 000000000000000001000000000100 2838 000000110000000000000000000000 2565 000000000000000011000000000000 2500 010000000000000010000000000000 2116 000000000000000000100000000100 2085 010000000000000000000000000100 1901 000000000000000010100000000000 1789 010000000000000000000000011000 1668 000100000000000000000000000100 1659 010000000000000000000000010000 1570 000100000000000010000000000000 1548 010000000000000000000000001000 1221 000000000000100010000000000000 932 010000000000000000000000000001 880 000000000000000000000000010100 840 010000000000110000000000000000 836 000000000000000001000000000001 829 000000000000010010000000000000 781 000000000000110010000000000000 766 010000000000100000000000000000 757 000000000000000000000000001100 737 000000000000000001000000000101 663 010000000000010000000000000000 630 000000000000000000000000011100 629 000010000000000000000000000010 616 000000000000000000000000010001 557 011000000000000000000000000000 555 000100000000000000000000000001 551 001000000000000001000000000000 537 000001000000000000000000100000 473 010000000000000000000000000101 471 000100000000000000000000000101 457 000000000000000000000000100100 451 011000000000000000000000000001 448 011000000000000010000000000000 439 001000000000000000000100000000 421 000000000000000110000000000000 402 000000000000000000000000001001 392 001000000000000001000000000100 386 001000000000000011000000000000 370 000000010000000010000000000000 336 000000000000000000100000000101 332 011000000000000000000000000100 320 001000000000000001000000000101 319 000000000000000000000000011001 287 000000000000000000100000000001 264 011000000000000000000000000101 258 010000000000000001000000000000 253 001000000000000001000000000001 232 000001000000000100000000000000 230 010000000000000011000000000000 216 000000000000000001000100000000 215 010000000000000000100000000000 184 000000010000000000000000000100 176 001000000000000000100000000100 174 010000000000000010100000000000 172 001000000000000000100000000000 170 010000000000000001000000000100 170 010000000000000000100000000100 164 010100000000000010000000000000 160 001000000000000010100000000000 156 000001010000000000000000000000 150 000001000000000000000000000100 145 000001000000000010000000000000 134 001100000000000000000000000000 133 000000010000000000000000000000 132 001100000000000000000000000100 124 010100000000000000000000000000 123 000000000000000000010100000000 117 000000000000000011100000000000 117 000000010000100000000000000000 116 000000000000000001100000000000 112 010100000000000000000000000100 112 000010000000001000000000000000 107 001100000000000010000000000000 106 000100000000000010100000000000 103 000000000000000000000000010101 97 001000000000000000100000000101 96 001100000000000000000000000001 96 000100000000000001000000000000 95 000000010000000000000000010000 94 001000000000000000000000010000 94 010000000000110010000000000000 94 000100000000000000100000000000 90 000000000000000000000000001101 89 000100000000000000100000000100 88 000100000000000011000000000000 88 000000000000000001100000000100 83 000100000000000001000000000100 83 001100000000000000000000000101 83 000000000000000000100000001000 82 000000000000100001000000000000 80 000000010000010000000000000000 80
PROC Star

## Re: Finding Most Common Combination

Just to give you one example, suppose products 3, 28, and 30 have the highest 3-way frequency.  Then the 2-way combination of products 3 and 28 must have at least as high a frequency (same for the combination of products 3 and 30, and products 28 and 30).

Discussion stats
• 9 replies
• 3109 views
• 0 likes
• 6 in conversation