Hi,
I'm working with a sas dataset containing 72 Binary Variables (indicating occurence/non - occurence of an item in the transaction) and 1 ID variable similar to the following setup:
ID | Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 | Item 7 | Item 8 | Item 9 | Item 10 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
I want to use PROC FREQ or PROC SQL to produce all possible 3, 4 and 5 way frequencies of these binary variables. I want the outputs is tables respectively designed for 3, 4 and 5 way possibilities as follows:
Variable 1 | Variable 2 | Variable 3 | Count |
---|---|---|---|
Item 1 | Item 2 | Item 3 | 3 |
Item 1 | Item 5 | Item 6 | 8 |
I tried producing this using sas looping macros with a mix of PROC FREQ, DATA STEP and PROC APPEND. I'm unable to produce the code here due to confidentiality reasons. I keep getting an out of memory error (in my Unix Server based SAS EG with a Windows Frontend) beyond a certain point of time which typically runs for several hours (only to discover the error). I was able to get this done for 2 way frequencies beyond which I have been without any luck.
My end goal is to produce a market basket analysis. Please let me know if there is a more efficient alternative to get this done. Thanks in advance for your help.
Thanks,
Prakash
It's no wonder that you ran out of memory, you are looking at about 15 million rows of output. 60K for the 3 way, 1+ million for the 4 way and 13+ million for the 5 way combination. (Google <72 chose 3> to get the exact count for 3-way).
Rather than this brute-force approach, see what others have done first. It's not my field, but I did find a lot just by Googling
market basket analysis site:sas.com
.
Doc Muhlbaier
Duke
Hi Doc,
Thanks for your response. I did realize up front that the number of combinations that can be possible is pretty high. I actually had put in a control within the code where I filter for and add only those frequencies with a compulsary positive value i.e. something like this
proc freq data = dataset;
tables Item1*Item2*Item3/out = freqset;
where Item1 = 1 and Item2 = 1 and Item3 = 1;
run;
The Item1, Item2 and Item3 will be controlled using a loop to ensure there are no repeats and scenarios where Item1 = Item2.
From your comments above I think 3 way should have worked because 60K rows should have been handled by SAS, but it looks like the brute force method is too intensive for SAS. I do agree that 1+ million and 13 million was always going to be iffy.
Thanks,
Prakash
hi ... here's one idea using PROC SUMMARY, the test data has 100 people and 10 binary variables
(but I agree with Doc ... it's a MUCH LARGER problem with many more variables)
it answers "3 at a time" and you can modify for 4 and/or 5
* test data;
data x;
array x(10);
do id = 1 to 100;
do _n_ = 1 to 10;
x(_n_) = (ranuni(999) gt .5);
end;
output;
end;
run;
* use summary with no var statement;
proc summary data=x ;
class x1-x10;
output out=counts;
run;
* look for observations with only 3 occurrences of 1;
data group3 (keep=v: _freq_);
array x(10);
array v(3) $32;
set counts;
if n(of x1-x10) eq 3 and sum(of x1-x10,0) eq 3;
loc = find(catt(of x1-x10),'1');
v1 = vname(x(loc));
loc = find(catt(of x1-x10),'1',loc+1);
v2 = vname(x(loc));
loc = find(catt(of x1-x10),'1',loc+1);
v3 = vname(x(loc));
run;
* change the order of the data set (x1 x2 x3 1st, x1 x2 x4 2nd, etc. );
proc sort data=group3 sortseq=linguistic(numeric_collation=on);
by v1 v2 v3;
run;
a portion of the output ...
v1 v2 v3 _FREQ_
x1 x2 x3 9
x1 x2 x4 10
x1 x2 x5 15
x1 x2 x6 15
x1 x2 x7 9
x1 x2 x8 8
x1 x2 x9 12
x1 x2 x10 13
x1 x3 x4 12
x1 x3 x5 18
x1 x3 x6 15
x1 x3 x7 14
x1 x3 x8 10
x1 x3 x9 11
x1 x3 x10 19
x1 x4 x5 16
x1 x4 x6 12
x1 x4 x7 14
x1 x4 x8 9
x1 x4 x9 12
x1 x4 x10 13
Hi MikeZDeb,
Let me try this approach first thing in the morning and let you know how it worked.
Thanks,
Prakash
WAYS
proc summary data=x chartype;
class x1-x10;
ways 3;
output out=counts;
run;
data counts;
set counts;
if n(of x:) eq sum(of x:);
run;
proc print;
run;
hi ... thanks, forgot about WAYS and gave up on trying to figure out how to use _TYPE_
it's too bad that WHERE does not support "OF" to make this a bit better ...
* only use observations with at least three occurrences of 1;
proc summary data=x (where=(sum (of x:) ge 3));
class x1-x10;
ways 3;
output out=counts;
run;
so why doesn't WHERE support "OF" ... I resorted to writing this to use when you'd like to use "OF" with WHERE ...
%macro whereof(func,var,n,also);
where &also &func(
%do j=1 %to %eval(&n-1);
&var&j ,
%end;
&var&j )
%mend;
proc summary data=x ;
%whereof(sum,x,10) ge 3;
class x1-x10;
ways 3;
output out=counts;
run;
Hard Luck with this approach as well.:smileyconfused: I'm encountering the same lengthy runs followed by the out of memory error. I think this is a SAS environment issue versus the aproach we are taking. I'm going to talk to my admin to see if something can be done.
Thanks,
Prakash
TOO many combinations....
369 data _null_;
370 do way=3,4,5;
371 x=comb(72,way);
372 put x=;
373 end;
374 run;
x=59640
x=1028790
x=13991544
Prakash,
As you have seen, you may strain the resources of your hardware. Here is the way I typically try to work with this sort of data to lessen the burden.
data subset;
set choices;
n_items = sum(of item_1 - item_72);
if (3 <= n_items <= 5);
length presented $ 14;
array items {72} item_1 - item_72;
position = 1;
do _i_=1 to 72;
if items{_i_}=1 then do;
substr(presented, position, 2) = put(_i_, z2.);
position = position + 3;
end;
end;
drop position _i_;
run;
By stringing the selections into one variable (presented), you get a lot of flexibility. For example:
proc freq data=subset;
where n_items=3; /* or 4 or 5 */
tables presented;
run;
Good luck.
Hi Astounding,
Thanks for your response. I'm currently identifying certain specific products and producing association rules using proc freq. Its worked well for me and has not put too much constraint on the system resources.
However, I plan to revisit the approach using your technique in the coming days as my end goal is to have a complete picture. I'll let you know how it works.
Thanks once again.
Prakash
Hi Astounding,
When I carefully reviewed your code i realized it doesn't produce what I exactly wanted. For Example, if you take a row with n_items = 5, it only produces the 5 items in the presented variables. However, there are 5C3 three way interactions and 5C4 4 way interations which do not get produced. Thats the whole idea of frequent item sets which we are trying to arrive at here.
Let us know your thoughts.
Thanks,
Prakash
Prakash,
I guess I'm misinterpreting the question. Maybe you can clarify.
Suppose there is a single incoming observation with n_items=5. There would be 5 5C4 combinations and 10 5C3 combinations. Are trying to count across all observations how many times each 5C3 combination appears?
That is correct. The goal is to count every possible 3 way/4 way and 5 way occurence i.e. in the above binary matrix all possibilities of 3 way, 4 way and 5 way frequencies of combinations of each item. Your idea was very interesting in terms of using the sum to determine the combinations. The way we should look to do it is to try and iteratively sum the binary variables within each row. So, to evaluate 3 way combinations you need to create 72C3 variables and scan for variables where we have sum = 3. Likewise for 4 and 5 way combinations.
To my knowledge this can be accomplished using macros but thats where I feel the root cause of our problems exist - Memory. I think the iterations are prohibitively high for my system.
Let us know your thoughts.
Thanks,
Prakash
Prakash,
This is feasible, if you shift your approach. Instead of creating many variables, create many observations instead. Here is the simplest case example, assuming that you have already created the data set SUBSET containing N_ITEMS and PRESENTED.
data _5C4_ (keep=_5C4_);
set subset (keep=n_items presented);
where n_items=5;
do i=1 to 13 by 3;
_5C4_ = presented;
substr(_5C4_, i, 2) = ' ';
_5C4_ = left(compbl(_5C4_));
output;
end;
run;
The idea is to calculate each possible 5C4 combination by blanking out one of the choices. Then standardize the format by getting rid of extra blanks. Finally, output the result as a separate observation.
You won't run out of memory creating the data set. At some point, however, you could run out of memory if you use PROC FREQ to do the counting. You may have to switch to brute force: SORT, then count in a DATA step.
The processing for 5C3 combinations is a little fancier, but follows the same general idea.
data _5C3_ (keep=_5C3_);
set subset (keep=n_items presented);
where n_items=5;
do i=1 to 10 by 3;
do j=i+3 to 13 by 3;
_5C3_ = presented;
substr(_5C3_, i, 2) = ' ';
substr(_5C3_, j, 2) = ' ';
_5C3_ = left(compbl(_5C3_));
output;
end;
end;
run;
This is untested code, so you may need to tweak it. If you get ambitious, you could manage to use one data step instead of two. But once you switch to creating multiple observations, you will trade your memory issues for storage space consumption. Also note that if you are starting with choice sets of more than 5 items, you will need to increase the length of PRESENTED accordingly.
Good luck.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.