BookmarkSubscribeRSS Feed
PrakashSridharan
Calcite | Level 5

Hi,

I'm working with a sas dataset containing 72 Binary Variables (indicating occurence/non - occurence of an item in the transaction) and 1 ID variable similar to the following setup:

IDItem 1Item 2Item 3Item 4Item 5Item 6Item 7Item 8Item 9Item 10
11000100000
20000110000

I want to use PROC FREQ or PROC SQL to produce all possible 3, 4 and 5 way frequencies of these binary variables. I want the outputs is tables respectively designed for 3, 4 and 5 way possibilities as follows:

Variable 1Variable 2Variable 3Count
Item 1Item 2Item 33
Item 1Item 5Item 68


I tried producing this using sas looping macros with a mix of PROC FREQ, DATA STEP and PROC APPEND. I'm unable to produce the code here due to confidentiality reasons. I keep getting an out of memory error (in my Unix Server based SAS EG with a Windows Frontend) beyond a certain point of time which typically runs for several hours (only to discover the error). I was able to get this done for 2 way frequencies beyond which I have been without any luck.

My end goal is to produce a market basket analysis. Please let me know if there is a more efficient alternative to get this done. Thanks in advance for your help.

Thanks,

Prakash

15 REPLIES 15
Doc_Duke
Rhodochrosite | Level 12

It's no wonder that you ran out of memory, you are looking at about 15 million rows of output.  60K for the 3 way, 1+ million for the 4 way and 13+ million for the 5 way combination. (Google <72 chose 3> to get the exact count for 3-way).

Rather than this brute-force approach, see what others have done first.  It's not my field, but I did find a lot just by Googling

market basket analysis site:sas.com

.

Doc Muhlbaier

Duke

PrakashSridharan
Calcite | Level 5

Hi Doc,

Thanks for your response. I did realize up front that the number of combinations that can be possible is pretty high. I actually had put in a control within the code where I filter for and add only those frequencies with a compulsary positive value i.e. something like this

     proc freq data = dataset;

          tables Item1*Item2*Item3/out = freqset;

               where Item1 = 1 and Item2 = 1 and Item3 = 1;

     run;

The Item1, Item2 and Item3 will be controlled using a loop to ensure there are no repeats and scenarios where Item1 = Item2.

From your comments above I think 3 way should have worked because 60K rows should have been handled by SAS, but it looks like the brute force method is too intensive for SAS. I do agree that 1+ million and 13 million was always going to be iffy.

Thanks,

Prakash

MikeZdeb
Rhodochrosite | Level 12

hi ... here's one idea using PROC SUMMARY, the test data has 100 people and 10 binary variables

(but I agree with Doc ... it's a MUCH LARGER problem with many more variables)

it answers "3 at a time" and you can modify for 4 and/or 5

* test data;

data x;

array x(10);

do id  = 1 to 100;

do _n_ = 1 to 10;

    x(_n_) = (ranuni(999) gt .5);

end;

output;

end;

run;

* use summary with no var statement;

proc summary data=x ;

class x1-x10;

output out=counts;

run;

* look for observations with only 3 occurrences of 1;

data group3 (keep=v: _freq_);

array x(10);

array v(3) $32;

set counts;

if n(of x1-x10) eq 3 and sum(of x1-x10,0) eq 3;

loc = find(catt(of x1-x10),'1');

v1 = vname(x(loc));

loc = find(catt(of x1-x10),'1',loc+1);

v2 = vname(x(loc));

loc = find(catt(of x1-x10),'1',loc+1);

v3 = vname(x(loc));

run;

* change the order of the data set (x1 x2 x3 1st, x1 x2 x4 2nd, etc. );

proc sort data=group3 sortseq=linguistic(numeric_collation=on);

by v1 v2 v3;

run;

a portion of the output ...

v1    v2    v3     _FREQ_


x1    x2    x3        9

x1    x2    x4       10

x1    x2    x5       15

x1    x2    x6       15

x1    x2    x7        9

x1    x2    x8        8

x1    x2    x9       12

x1    x2    x10      13

x1    x3    x4       12

x1    x3    x5       18

x1    x3    x6       15

x1    x3    x7       14

x1    x3    x8       10

x1    x3    x9       11

x1    x3    x10      19

x1    x4    x5       16

x1    x4    x6       12

x1    x4    x7       14

x1    x4    x8        9

x1    x4    x9       12

x1    x4    x10      13

PrakashSridharan
Calcite | Level 5

Hi MikeZDeb,

Let me try this approach first thing in the morning and let you know how it worked.

Thanks,

Prakash

data_null__
Jade | Level 19

WAYS

proc summary data=x chartype;

   class x1-x10;

   ways 3;

   output out=counts;

   run;

data counts;

   set counts;

   if n(of x:) eq sum(of x:);

   run;

proc print;

   run;

MikeZdeb
Rhodochrosite | Level 12

hi ... thanks, forgot about WAYS and gave up on trying to figure out how to use _TYPE_

it's too bad that WHERE does not support "OF" to make this a bit better ...

* only use observations with at least three occurrences of 1;

proc summary data=x (where=(sum (of x:) ge 3));

class x1-x10;

ways 3;

output out=counts;

run;

so why doesn't WHERE support "OF" ... I resorted to writing this to use when you'd like to use "OF" with WHERE ...

%macro whereof(func,var,n,also);

where &also &func(

%do j=1 %to %eval(&n-1);

   &var&j ,

%end;

&var&j )

%mend;

proc summary data=x ;

%whereof(sum,x,10) ge 3;

class x1-x10;

ways 3;

output out=counts;

run;

PrakashSridharan
Calcite | Level 5

Hard Luck with this approach as well.:smileyconfused: I'm encountering the same lengthy runs followed by the out of memory error. I think this is a SAS environment issue versus the aproach we are taking. I'm going to talk to my admin to see if something can be done.

Thanks,

Prakash

data_null__
Jade | Level 19

TOO many combinations....

369  data _null_;

370     do way=3,4,5;

371        x=comb(72,way);

372        put x=;

373        end;

374     run;

x=59640

x=1028790

x=13991544

Astounding
PROC Star

Prakash,

As you have seen, you may strain the resources of your hardware.  Here is the way I typically try to work with this sort of data to lessen the burden.

data subset;

set choices;

n_items = sum(of item_1 - item_72);

if (3 <= n_items <= 5);

length presented $ 14;

array items {72} item_1 - item_72;

position = 1;

do _i_=1 to 72;

   if items{_i_}=1 then do;

      substr(presented, position, 2) = put(_i_, z2.);

      position = position + 3;

   end;

end;

drop position _i_;

run;

By stringing the selections into one variable (presented), you get a lot of flexibility.  For example:

proc freq data=subset;

   where n_items=3; /* or 4 or 5 */

   tables presented;

run;

Good luck.

PrakashSridharan
Calcite | Level 5

Hi Astounding,

Thanks for your response. I'm currently identifying certain specific products and producing association rules using proc freq. Its worked well for me and has not put too much constraint on the system resources.

However, I plan to revisit the approach using your technique in the coming days as my end goal is to have a complete picture. I'll let you know how it works.

Thanks once again.

Prakash

PrakashSridharan
Calcite | Level 5

Hi Astounding,

When I carefully reviewed your code i realized it doesn't produce what I exactly wanted. For Example, if you take a row with n_items = 5, it only produces the 5 items in the presented variables. However, there are 5C3 three way interactions and 5C4 4 way interations which do not get produced. Thats the whole idea of frequent item sets which we are trying to arrive at here.

Let us know your thoughts.

Thanks,

Prakash  

Astounding
PROC Star

Prakash,

I guess I'm misinterpreting the question.  Maybe you can clarify.

Suppose there is a single incoming observation with n_items=5.  There would be 5 5C4 combinations and 10 5C3 combinations.   Are trying to count across all observations how many times each 5C3 combination appears?

PrakashSridharan
Calcite | Level 5

That is correct. The goal is to count every possible 3 way/4 way and 5 way occurence i.e. in the above binary matrix all possibilities of 3 way, 4 way and 5 way frequencies of combinations of each item. Your idea was very interesting in terms of using the sum to determine the combinations. The way we should look to do it is to try and iteratively sum the binary variables within each row. So, to evaluate 3 way combinations you need to create 72C3 variables and scan for variables where we have sum = 3. Likewise for 4 and 5 way combinations.

To my knowledge this can be accomplished using macros but thats where I feel the root cause of our problems exist - Memory. I think the iterations are prohibitively high for my system.

Let us know your thoughts.

Thanks,

Prakash

Astounding
PROC Star

Prakash,

This is feasible, if you shift your approach.  Instead of creating many variables, create many observations instead.  Here is the simplest case example, assuming that you have already created the data set SUBSET containing N_ITEMS and PRESENTED.

data _5C4_ (keep=_5C4_);

   set subset (keep=n_items presented);

   where n_items=5;

   do i=1 to 13 by 3;

      _5C4_ = presented;

      substr(_5C4_, i, 2) = '  ';

      _5C4_ = left(compbl(_5C4_));

      output;

   end;

run;

The idea is to calculate each possible 5C4 combination by blanking out one of the choices.  Then standardize the format by getting rid of extra blanks.  Finally, output the result as a separate observation.

You won't run out of memory creating the data set.  At some point, however, you could run out of memory if you use PROC FREQ to do the counting.  You may have to switch to brute force:  SORT, then count in a DATA step.

The processing for 5C3 combinations is a little fancier, but follows the same general idea.

data _5C3_ (keep=_5C3_);

   set subset (keep=n_items presented);

   where n_items=5;

   do i=1 to 10 by 3;

      do j=i+3 to 13 by 3;

         _5C3_ = presented;

         substr(_5C3_, i, 2) = '  ';

         substr(_5C3_, j, 2) = '  ';

         _5C3_ = left(compbl(_5C3_));

         output;

      end;

   end;

run;

This is untested code, so you may need to tweak it.  If you get ambitious, you could manage to use one data step instead of two.  But once you switch to creating multiple observations, you will trade your memory issues for storage space consumption.  Also note that if you are starting with choice sets of more than 5 items, you will need to increase the length of PRESENTED accordingly.

Good luck.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1413 views
  • 0 likes
  • 5 in conversation