turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Producing all possible 3 way, 4 way and 5 way freq...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-23-2012 05:03 AM

Hi,

I'm working with a sas dataset containing 72 Binary Variables (indicating occurence/non - occurence of an item in the transaction) and 1 ID variable similar to the following setup:

ID | Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 | Item 7 | Item 8 | Item 9 | Item 10 |
---|---|---|---|---|---|---|---|---|---|---|

1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

2 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |

I want to use PROC FREQ or PROC SQL to produce all possible 3, 4 and 5 way frequencies of these binary variables. I want the outputs is tables respectively designed for 3, 4 and 5 way possibilities as follows:

Variable 1 | Variable 2 | Variable 3 | Count |
---|---|---|---|

Item 1 | Item 2 | Item 3 | 3 |

Item 1 | Item 5 | Item 6 | 8 |

I tried producing this using sas looping macros with a mix of PROC FREQ, DATA STEP and PROC APPEND. I'm unable to produce the code here due to confidentiality reasons. I keep getting an out of memory error (in my Unix Server based SAS EG with a Windows Frontend) beyond a certain point of time which typically runs for several hours (only to discover the error). I was able to get this done for 2 way frequencies beyond which I have been without any luck.

My end goal is to produce a market basket analysis. Please let me know if there is a more efficient alternative to get this done. Thanks in advance for your help.

Thanks,

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PrakashSridharan

01-23-2012 10:16 AM

It's no wonder that you ran out of memory, you are looking at about 15 million rows of output. 60K for the 3 way, 1+ million for the 4 way and 13+ million for the 5 way combination. (Google <72 chose 3> to get the exact count for 3-way).

Rather than this brute-force approach, see what others have done first. It's not my field, but I did find a lot just by Googling

market basket analysis site:sas.com

.

Doc Muhlbaier

Duke

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Doc_Duke

01-23-2012 11:16 AM

Hi Doc,

Thanks for your response. I did realize up front that the number of combinations that can be possible is pretty high. I actually had put in a control within the code where I filter for and add only those frequencies with a compulsary positive value i.e. something like this

proc freq data = dataset;

tables Item1*Item2*Item3/out = freqset;

where Item1 = 1 and Item2 = 1 and Item3 = 1;

run;

The Item1, Item2 and Item3 will be controlled using a loop to ensure there are no repeats and scenarios where Item1 = Item2.

From your comments above I think 3 way should have worked because 60K rows should have been handled by SAS, but it looks like the brute force method is too intensive for SAS. I do agree that 1+ million and 13 million was always going to be iffy.

Thanks,

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PrakashSridharan

01-23-2012 10:53 AM

hi ... here's one idea using PROC SUMMARY, the test data has 100 people and 10 binary variables

(but I agree with Doc ... it's a MUCH LARGER problem with many more variables)

it answers "3 at a time" and you can modify for 4 and/or 5

*** test data;**

**data x;**

**array x(10);**

**do id = 1 to 100;**

**do _n_ = 1 to 10;**

** x(_n_) = (ranuni(999) gt .5);**

**end;**

**output;**

**end;**

**run;**

*** use summary with no var statement;**

**proc summary data=x ;**

**class x1-x10;**

**output out=counts;**

**run;**

*** look for observations with only 3 occurrences of 1;**

**data group3 (keep=v: _freq_);**

**array x(10);**

**array v(3) $32;**

**set counts;**

**if n(of x1-x10) eq 3 and sum(of x1-x10,0) eq 3;**

**loc = find(catt(of x1-x10),'1');**

**v1 = vname(x(loc));**

**loc = find(catt(of x1-x10),'1',loc+1);**

**v2 = vname(x(loc));**

**loc = find(catt(of x1-x10),'1',loc+1);**

**v3 = vname(x(loc));**

**run;**

*** change the order of the data set (x1 x2 x3 1st, x1 x2 x4 2nd, etc. );**

**proc sort data=group3 sortseq=linguistic(numeric_collation=on);**

**by v1 v2 v3;**

**run;**

a portion of the output ...

**v1 v2 v3 _FREQ_**

**x1 x2 x3 9**

**x1 x2 x4 10**

**x1 x2 x5 15**

**x1 x2 x6 15**

**x1 x2 x7 9**

**x1 x2 x8 8**

**x1 x2 x9 12**

**x1 x2 x10 13**

**x1 x3 x4 12**

**x1 x3 x5 18**

**x1 x3 x6 15**

**x1 x3 x7 14**

**x1 x3 x8 10**

**x1 x3 x9 11**

**x1 x3 x10 19**

**x1 x4 x5 16**

**x1 x4 x6 12**

**x1 x4 x7 14**

**x1 x4 x8 9**

**x1 x4 x9 12**

**x1 x4 x10 13**

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to MikeZdeb

01-23-2012 11:18 AM

Hi MikeZDeb,

Let me try this approach first thing in the morning and let you know how it worked.

Thanks,

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to MikeZdeb

01-23-2012 11:27 AM

WAYS

procsummarydata=x chartype;class x1-x10;

ways

3;output out=counts;

run;

datacounts;set counts;

if n(of x:) eq sum(of x:);

run;

proc

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to data_null__

01-23-2012 11:50 AM

hi ... thanks, forgot about WAYS and gave up on trying to figure out how to use _TYPE_

it's too bad that WHERE does not support "OF" to make this a bit better ...

*** only use observations with at least three occurrences of 1;**

**proc summary data=x (where=(sum (of x ge 3));**

**class x1-x10;**

`ways 3;`

**output out=counts;**

**run;**

so why doesn't WHERE support "OF" ... I resorted to writing this to use when you'd like to use "OF" with WHERE ...

**%macro whereof(func,var,n,also);**

**where &also &func(**

**%do j=1 %to %eval(&n-1);**

** &var&j ,**

**%end;**

**&var&j )**

**%mend;**

**proc summary data=x ;**

**%whereof(sum,x,10) ge 3;**

**class x1-x10;**

`ways 3;`

**output out=counts;**

**run;**

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to MikeZdeb

01-24-2012 12:06 AM

Hard Luck with this approach as well.:smileyconfused: I'm encountering the same lengthy runs followed by the out of memory error. I think this is a SAS environment issue versus the aproach we are taking. I'm going to talk to my admin to see if something can be done.

Thanks,

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PrakashSridharan

01-24-2012 07:32 AM

TOO many combinations....

369 data _null_;

370 do way=3,4,5;

371 x=comb(72,way);

372 put x=;

373 end;

374 run;

x=59640

x=1028790

x=13991544

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PrakashSridharan

01-27-2012 07:44 PM

Prakash,

As you have seen, you may strain the resources of your hardware. Here is the way I typically try to work with this sort of data to lessen the burden.

data subset;

set choices;

n_items = sum(of item_1 - item_72);

if (3 <= n_items <= 5);

length presented $ 14;

array items {72} item_1 - item_72;

position = 1;

do _i_=1 to 72;

if items{_i_}=1 then do;

substr(presented, position, 2) = put(_i_, z2.);

position = position + 3;

end;

end;

drop position _i_;

run;

By stringing the selections into one variable (presented), you get a lot of flexibility. For example:

proc freq data=subset;

where n_items=3; /* or 4 or 5 */

tables presented;

run;

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

01-29-2012 11:46 PM

Hi Astounding,

Thanks for your response. I'm currently identifying certain specific products and producing association rules using proc freq. Its worked well for me and has not put too much constraint on the system resources.

However, I plan to revisit the approach using your technique in the coming days as my end goal is to have a complete picture. I'll let you know how it works.

Thanks once again.

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

01-31-2012 03:47 AM

Hi Astounding,

When I carefully reviewed your code i realized it doesn't produce what I exactly wanted. For Example, if you take a row with n_items = 5, it only produces the 5 items in the presented variables. However, there are 5C3 three way interactions and 5C4 4 way interations which do not get produced. Thats the whole idea of frequent item sets which we are trying to arrive at here.

Let us know your thoughts.

Thanks,

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PrakashSridharan

01-31-2012 09:10 AM

Prakash,

I guess I'm misinterpreting the question. Maybe you can clarify.

Suppose there is a single incoming observation with n_items=5. There would be 5 5C4 combinations and 10 5C3 combinations. Are trying to count across all observations how many times each 5C3 combination appears?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

01-31-2012 09:38 AM

That is correct. The goal is to count every possible 3 way/4 way and 5 way occurence i.e. in the above binary matrix all possibilities of 3 way, 4 way and 5 way frequencies of combinations of each item. Your idea was very interesting in terms of using the sum to determine the combinations. The way we should look to do it is to try and iteratively sum the binary variables within each row. So, to evaluate 3 way combinations you need to create 72C3 variables and scan for variables where we have sum = 3. Likewise for 4 and 5 way combinations.

To my knowledge this can be accomplished using macros but thats where I feel the root cause of our problems exist - Memory. I think the iterations are prohibitively high for my system.

Let us know your thoughts.

Thanks,

Prakash

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PrakashSridharan

01-31-2012 10:06 AM

Prakash,

This is feasible, if you shift your approach. Instead of creating many variables, create many observations instead. Here is the simplest case example, assuming that you have already created the data set SUBSET containing N_ITEMS and PRESENTED.

data _5C4_ (keep=_5C4_);

set subset (keep=n_items presented);

where n_items=5;

do i=1 to 13 by 3;

_5C4_ = presented;

substr(_5C4_, i, 2) = ' ';

_5C4_ = left(compbl(_5C4_));

output;

end;

run;

The idea is to calculate each possible 5C4 combination by blanking out one of the choices. Then standardize the format by getting rid of extra blanks. Finally, output the result as a separate observation.

You won't run out of memory creating the data set. At some point, however, you could run out of memory if you use PROC FREQ to do the counting. You may have to switch to brute force: SORT, then count in a DATA step.

The processing for 5C3 combinations is a little fancier, but follows the same general idea.

data _5C3_ (keep=_5C3_);

set subset (keep=n_items presented);

where n_items=5;

do i=1 to 10 by 3;

do j=i+3 to 13 by 3;

_5C3_ = presented;

substr(_5C3_, i, 2) = ' ';

substr(_5C3_, j, 2) = ' ';

_5C3_ = left(compbl(_5C3_));

output;

end;

end;

run;

This is untested code, so you may need to tweak it. If you get ambitious, you could manage to use one data step instead of two. But once you switch to creating multiple observations, you will trade your memory issues for storage space consumption. Also note that if you are starting with choice sets of more than 5 items, you will need to increase the length of PRESENTED accordingly.

Good luck.