Re: which procedure can help me here?

deleted_user · Posted 06-29-2010 10:59 AM

Dear Forum,

I have data in the form of

ID date brand
1 1 1
1 1 2
1 1 3
1 2 1
2 1 1
2 1 2
2 2 1
2 2 1
3 1 1
3 2 2
3 3 1
4 1 1
4 2 2
4 2 1

Now, I want to know how many different brands were purchased on one date by one ID, and I want to include this information as a new variable in the dataset. How can I manage to do that? Does anyone know a proper procedure?

Regards
yel10orblu

deleted_user · Posted 06-29-2010 11:49 AM

one way do to is to use proc sql to create the summary.

deleted_user · Posted 06-29-2010 12:29 PM

could you probably provide me with a code example? because I'm kind of lost here...

Cynthia_sas · Posted 06-30-2010 11:55 AM

Hi:
Before I go down the SQL road or the DATA step/MERGE road, I like to see what I can do with the basic set of procedures for counting. Those "counting" procedures are: PROC FREQ, PROC TABULATE and PROC REPORT. Those are the 3 procedures I show in the program sample below. Each procedure shows 2 possibilities for getting either a report or a dataset (using OUT=). The first possibility shows just getting counts by ID and DATE and the second possibility for each procedure throws BRAND into the mix for counting.

Depending on your ultimate goal -- for example, if you were just creating a dataset of counts so you could do a PROC PRINT -- then you might not need to create an output data set at all. You might be happy with the report step output for a particular procedure. If you run the code and examine the output from each step, you will see that each procedure has its own "look and feel" for how the report is structured and how the output dataset looks.

(You can also get PROC MEANS to do counting for you -- but I was already at 6 code examples, so I didn't show a PROC MEANS example.)

cynthia
[pre]
data brand;
infile datalines;
input ID date brand;
return;
datalines;
1 1 1
1 1 2
1 1 3
1 2 1
2 1 1
2 1 2
2 2 1
2 2 1
3 1 1
3 2 2
3 3 1
4 1 1
4 2 2
4 2 1
;
run;

options nodate nonumber missing=0;

ods listing close;
ods html file='c:\temp\diff_approaches.html' style=sasweb;
proc freq data=brand;
title '1) Proc Freq Approach -- ID and DATE counts';
tables id*date / list out=work.frq_id1;
run;

proc print data=work.frq_id1;
title 'DATASET created by PROC FREQ 1';
run;

proc freq data=brand;
title '2) Proc Freq Approach -- ID, DATE and BRAND counts';
tables id*date*brand / list out=work.frq_id2;
run;

proc print data=work.frq_id2;
title 'DATASET created by PROC FREQ 2';
run;

proc tabulate data=brand f=comma6. out=work.tab_id1;
title '3) PROC TABULATE Approach 1 -- ID and DATE';
class id date;
table id, date all;
run;

proc print data=work.tab_id1;
title 'DATASET created by PROC TABULATE 1';
run;

proc tabulate data=brand f=comma6. out=work.tab_id2;
title '4) PROC TABULATE Approach 2 -- ID, DATE and BRAND';
class id date brand;
table id, date*brand*n all*brand*n all;
keylabel n=' ';
run;

proc print data=work.tab_id2;
title 'DATASET created by PROC TABULATE 2';
run;

proc report data=brand nowd out=work.rep_id1;
title '5) PROC REPORT Approach 1 -- ID and DATE';
column id date n;
define id / group;
define date / across;
define n / 'Total';
run;

proc print data=work.rep_id1;
title 'DATASET created by PROC REPORT 1';
run;

proc report data=brand nowd out=work.rep_id2 completerows;
title '6) PROC REPORT Approach 2 -- ID, DATE and BRAND';
column id date brand n;
define id / group;
define date / group;
define brand / across;
define n / 'Total';
run;

proc print data=work.rep_id2;
title 'DATASET created by PROC REPORT 2';
run;

ods html close;
title;

[/pre]

Peter_C · Posted 07-02-2010 08:50 AM

Hi Cynthia

now that tabulate can be used to create an output dataset, do you think proc means will go the way of summary and be subsumed within tabulate?
Or are there features of proc means that are not available from tabulate?

regards
peter

Cynthia_sas · Posted 07-02-2010 01:15 PM

Hi, Peter:
I don't have a crystal ball or a direct line to the developers, but I hope not. PROC MEANS/SUMMARY is good for "quick in/out make me the data, get me the report, I don't need it in a particular structure". I see PROC MEANS/SUMMARY as being the first step in the road to other reporting procedures.

I look at it like this: I can put a single egg (data) in my Cuisanart food processor (TABULATE) and whisk it that way, or I could just use my simple handheld whisk or egg beater that gets the job done too (MEANS/SUMMARY).

When you get to the point with PROC MEANS that you say, "Oh, this is wonderful, but I want the values of WHIRLYGIGS nested inside the values for GEWGAWS and I want a new page for every unique value of FRAMISTAT. And could I have the MIN, MEAN and MAX statistics underneath a column for each unique value of KNICKKNACKS. And I need a count of TOTAL WHIRLYGIGS by KNICKKNACKS as the final column." That's when I haul out the Cuisanart.

PROC MEANS has the TYPES and WAYS statements, which I love, because they simplify the way that you can get different combinations of crossings of the CLASS variables -- in a more shorthand format -- than having multiple TABLE statements with TABULATE. And, MEANS/SUMMARY has the ID statement to get other vars into the output dataset. Now that TABULATE can create an output dataset, I think that's totally cool and I have used the feature -- not as much as I just use PROC MEANS for OUTPUT datasets or even PROC REPORT for OUTPUT datasets. I guess my opinion is that PROC MEANS/SUMMARY combination made sense, since by the time they were rewritten to use/create output objects, they were essentially the same procedure.

I like having PROC MEANS and PROC TABULATE as 2 separate procedures. They are enough different that I can see MEANS appealing to one audience for its simplicity of use and TABULATE appealing to another audience for its power and elegant TABLE statement syntax. I know that TABULATE can potentially have a large number of crossings for the CLASS variable (2^31-1) and I'm not sure on whether MEANS has the same limit or not, but that might be one area of difference.

So, no hard and fast answers, just my .02...
cynthia

Peter_C · Posted 07-05-2010 08:54 AM

Cynthia
thank you for your thoughts.
appreciate the insights.
Until the "stats gathering engines" are made common between base SAS procedures (tabulate, means, freq, univariate) (if ever) it seems most unlikely that the extremes through "id-group-specification" of proc Means, would be available to Tabulate.
Given the features of the procs it seems satisfactory to consider proc Means as the provider of stats in table(s) and Proc Tabulate as the reporter of stats (along with Proc Report - such a mixture of detail and summary - like reporting "on steroids"). [and maybe someday I'll appreciate more, the difference that Proc Freq makes, too]

Regarding the handling of large cross-classings (2^31-1) : I bumped into that when testing run-time performance of Proc Tabulate. There is a performance issue for large cross-classing combinations in the original Tabulate "stats-gathering" engine (which is still there to guarantee legacy support in the Listing destination).
With the test table of 3 observations with 20 class vars each with unique values, generated in[pre]data test ;
input (c1-c20 )($1.) ;
cards ;
qwertyuiopasdfghjklz
wertyuiopasdfghjklz
ertyuiopasdfghjklz
;;;[/pre] obviously artificial, but easy to re-type 😉 [pre]proc tabulate data= test(obs=3) missing format= 5.1 ;
class c1-c20 ;
table c1*c2*c3*c4*c5*c6*c7*c8*c9*c10
* c11*c12*c13*c14*c15*c16*c17*c18*c19*c20
, n /rts=81 ;
run ;[/pre]With these 3 obs and 20 class variables the message occurs
ERROR: Limit of 2147483647 potential interactions within a crossing exceeded for the table statement at line nnn.
It is just because 3**20 is greater than 2**31-1. Fortunately, in the listing destination, I don't want such an analysis.[pre]proc means noprint missing nway ;
class c1-c20 ;
output ;
run ;[/pre]successfully summarises the combinations into a sas data set, well beyond these limits in Tabulate. For 20 class vars the run time was almost measureable at 0.01 seconds. 🙂
As 3^19 is less than 2^31, proc Tabulate will try analysing 3 obs with 19 class variables. The runtime is shown in this log[pre]316 proc tabulate data= test(obs=3) missing format= 5.1 ;
317 class c1-c20 ;
318 table c1*c2*c3*c4*c5*c6*c7*c8*c9*c10
319 * c11*c12*c13*c14*c15*c16*c17*c18*c19
320 , n /rts=81 ;
321 run ;

NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: PROCEDURE TABULATE used (Total process time):
real time 4:40.31 😞
user cpu time 4:39.14
system cpu time 0.09 seconds
Memory 14077k[/pre]Writing to html instead of listing, the runtime was very very much better [pre] real time 0.12 seconds 🙂 [/pre]
I still like(++) Tabulate, but I'll stick to PROC MEANS for collecting summary statistics (just in case listing is open),
thank you

peterC

Cynthia_sas · Posted 07-05-2010 10:58 AM

Thanks, Peter...
That was very interesting. And it finally helped explain a bit of TABULATE "superstition" that I've been carrying around with me since the 80s. When I first started using SAS and worked at the university computer center, there were only 1 or 2 people using TABULATE. And, as they were explaining it to me and how wonderful it was, they kept advising that I shouldn't try to nest more than 10 vars in a single dimension. That little bit of advice stuck with me and I never have nested more than 10 levels in a single dimension, so I never would have tried your test code.

But in fact, if I modify your code just a bit, to have 10 variables in the ROW dim and 10 variables in the COL dim I do NOT get the message in the log. I always figured they must have had some reason for the advice (and one of the folks had known the older, TPL language) -- but their answer was always "it's just better" or "takes less memory" or "that's the way I've always done it".

So, I wonder if that was their way of bypassing the 2^31-1 issue. Anyway, here's my log with all 20 vars -- sent to LISTING and HTML (had to increase PS for LISTING or else it would complain).

cynthia
[pre]
66 options ls=256 ps=500;
67 ods listing;
68 ods html file='c:\temp\peter_10.html' style=sasweb;
NOTE: Writing HTML Body file: c:\temp\peter_10.html
69 proc tabulate data= test(obs=3) missing format= 5.1 ;
70 class c1-c20 ;
71 table c1*c2*c3*c4*c5*c6*c7*c8*c9*c10,
72 c11*c12*c13*c14*c15*c16*c17*c18*c19*c20
73 /rts=81 ;
74 run ;

NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: PROCEDURE TABULATE used (Total process time):
real time 0.32 seconds
cpu time 0.06 seconds

75 ods html close;

[/pre]

Peter_C · Posted 07-05-2010 06:56 PM

Thank you Cynthia

very odd layout

moving the comma along to between c19 and c20 creates a much slower run time - back to the 4+minutes level

there used to be a tabulate procedure option defaulting to DEPTH=10, which may have been part of the issue about cross-classing 10 class-variables.

I'm going to take some time to explain to myself what is going on with that odd layout...

later
peter

peatjohnston · Posted 06-30-2010 07:24 AM

Hello,

My coding is crude, but this is how I would address your question:

data orig;
input ID date brand;
datalines;
1 1 1
1 1 2
1 1 3
1 2 1
2 1 1
2 1 2
2 2 1
2 2 1
3 1 1
3 2 2
3 3 1
4 1 1
4 2 2
4 2 1
;
proc sort data=orig out=new1 nodup;
by id date brand;
run;
proc means data=new1 noprint;
by ID date;
var brand;
output out=new n=brands;
run;
data joined; merge orig new;
by ID date;
run;
proc print data=joined;
run;

you can use the nodup statement in proc sort to strip out duplicate records by data and id and a new data set (new1) is created. The means procedure is then used to count the unique brands by id and date and creates another data set (new) with the variable 'brands' representing the # of different brands that were purchased by id and date. Then you merge the original data set with the one created in the means procedure.

There may be an easier way to do this, but this is a method I have been using for years. If someone knows a more elegant way, I would be interested to learn.

peat

deleted_user · Posted 06-30-2010 11:55 AM

thank you so much!!

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away