DATA Step, Macro, Functions and more

PROC FREQ with large crosstabulation - how do I increase processing time?

Reply
Frequent Contributor
Posts: 141

PROC FREQ with large crosstabulation - how do I increase processing time?

Hi,

 

I have a proc frequency statement with multiple variables for cross tablulation.

 

Example: 

 

proc freq data=abc.mydataset;
format _CHAR_ $missfmt.;
format _NUMERIC_ missfmt.;
tables var_1*(var_2 date_1 num_1 var_3 num_2 date_2 num_3) / NOCUM NOPERCENT MISSING;
run;

 

This particular dataset has over 19,000 observations.  It took 1 hour to run the freq statement on my local machine. 

 

Any advice on ways to speed this up?  I know it is large.

PROC Star
Posts: 7,492

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Posted in reply to elwayfan446

Since 19,000 is an extremely small dataset for SAS, I have to think that some or all of your variables have unique values, resulting in tables that would be difficult to use (e.g., at worst case, trying to create 19,000 by 19,000 matrices for each combination.

 

If that is so, and you don't really need the actual comparison of values, one way to speed things up would be to create and include formats that grouped the results.

 

However, difficult to know without seeing the dataset and knowing what you are trying to accomplish.

 

Art, CEO, AnalystFinder.com

 

Frequent Contributor
Posts: 141

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Yes, you are correct. A lot of these variables do have unique values. I may try the grouping if all else fails. I am just trying to keep the code to a minimum.
Super User
Posts: 11,343

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Posted in reply to elwayfan446

It is somewhat likely that a large amount of the time is spent in formatting the output. 100 row values* 100 column values = 10,000 cells and that could be for just two variables.

You might try

proc freq data=abc.mydataset;
format _CHAR_ $missfmt.;
format _NUMERIC_ missfmt.;
tables var_1*(var_2 date_1 num_1 var_3 num_2 date_2 num_3) / NOCUM NOPERCENT MISSING list;
run;

 

The output will display as

var_1 value1 , var_2 value 1  counts and percents

var_1 value1 , var_2 value 2  counts and percents

 

Which usually takes less time.

 

If you are looking for only those cases where something is missing you might add a where statement:

where missing(var_1) or missing(var_2) or missing(date_1) <etc.>;

to reduce the number of records.

 

 

Frequent Contributor
Posts: 141

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

The first thing I tried was the LIST option as you suggested.  This worked within 20 seconds.  This leads me to believe you guys are all correct that the formatting is what is bogging it down. 

Regular Contributor
Posts: 222

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Posted in reply to elwayfan446

One other thing you can try, run the forllowing

Proc Options group=performance; run;

within your SAS session to determine what settings you have inplace, then try to tweak them if you can. I'm talking about

- memsize=

- sortsize=

- sumsize=

- threads/nothreads

 

Changing the values of these options can yield better performance. The other SAS options you should have are Fullstimer, and msglevel

options fullstimer msglevel=i;

 

They will inform you of how your SAS system utlizing the available/allocated resources.

 

Hope this helps,

Ahmed

Trusted Advisor
Posts: 1,022

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

[ Edited ]
Posted in reply to elwayfan446

As @art297 and @ballardw say, it probably takes a lot of time because the cardinality of the crosstabulated variables  is high. 

 

However, I think most of the clock time is not in the tabulation, but in the development of the results file.   In the 1,000 by 1,000 frequency below, note that with NOPRINT, the proc freq took only .81 secs, while with printing of a straightforward LISTING results was 4 time that: 3.32 sec's,  When I enabled HTML results, the process took 2:40 (160 seconds, a ratio of about 200).  (Editted additional note: it took about another 2 minutes for pc-sas to display the html result).

 

My suggestion.  If it's too soon to reduce cardinality per Art's suggestion, (1) run the program with NOPRINT (be sure to use the OUT= option on your tables statements), or (2) make sure only LISTING is enabled for results.

 

 

5441   data test;
5442     do a=1 to 1000; do b=1 to 1000; output; end; end;
5443   run;

NOTE: The data set WORK.TEST has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds


5444   proc freq data=test noprint;
5445     tables a*b / out=a_by_b;
5446   run;

NOTE: There were 1000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.A_BY_B has 1000000 observations and 4 variables.
NOTE: PROCEDURE FREQ used (Total process time):
      real time           0.81 seconds
      cpu time            0.81 seconds


5447   proc freq data=test ;
5448     tables a * b / out=a_by_b;
5449   run;

NOTE: There were 1000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.A_BY_B has 1000000 observations and 4 variables.
NOTE: PROCEDURE FREQ used (Total process time):
      real time           3.32 seconds
      cpu time            3.32 seconds

 

Frequent Contributor
Posts: 141

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Thanks everyone.  I am going to try these suggestions and get back with you.

 

@mkeintz, I agree with you... I think creating the file may be the issue as well.  In EG, I have the program task set to send the results to a PDF file.  My understanding is that, in EG, this is the equivalent of the ODS statement to export to PDF.  Is that correct? Would your suggestion work in EG programs?

Trusted Advisor
Posts: 1,022

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Posted in reply to elwayfan446

I don't use EG, but I have a copy, so I fired it up.  It has a TOOLS/OPTIONS window, which has a check box for "text output".  Check thant and uncheck the "SAS Report" box.  That's probably the equivalent to my suggestion.

 

SAS Super FREQ
Posts: 3,756

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

[ Edited ]
Posted in reply to elwayfan446

Another option to speed things up: Use the NOPRINT option and write the output to a data set

 

proc freq noprint;

tables x y z / out=FreqOut;

run;

 

EDIT: Repeat of 

Regular Contributor
Posts: 227

Re: PROC FREQ with large crosstabulation - how do I increase processing time?

Posted in reply to elwayfan446

The problem is that SAS is building a large sparse matrix for the results.

For your case where lower vars are exclusive this means using more memory than needed.

 

This paper discusses how to concatenate all vars in the x-tab into one character variable

to reduce time.

 

Summarizing Impossibly Large SAS Data Sets for the Data Warehouse Server Using Horizontal Summarization

 

Michael A. Raithel

 

SUGI 2000


http://www2.sas.com/proceedings/sugi25/25/sy/25p290.pdf

Ask a Question
Discussion stats
  • 10 replies
  • 301 views
  • 2 likes
  • 7 in conversation