Hi,
I have a proc frequency statement with multiple variables for cross tablulation.
Example:
proc freq data=abc.mydataset;
format _CHAR_ $missfmt.;
format _NUMERIC_ missfmt.;
tables var_1*(var_2 date_1 num_1 var_3 num_2 date_2 num_3) / NOCUM NOPERCENT MISSING;
run;
This particular dataset has over 19,000 observations. It took 1 hour to run the freq statement on my local machine.
Any advice on ways to speed this up? I know it is large.
Since 19,000 is an extremely small dataset for SAS, I have to think that some or all of your variables have unique values, resulting in tables that would be difficult to use (e.g., at worst case, trying to create 19,000 by 19,000 matrices for each combination.
If that is so, and you don't really need the actual comparison of values, one way to speed things up would be to create and include formats that grouped the results.
However, difficult to know without seeing the dataset and knowing what you are trying to accomplish.
Art, CEO, AnalystFinder.com
It is somewhat likely that a large amount of the time is spent in formatting the output. 100 row values* 100 column values = 10,000 cells and that could be for just two variables.
You might try
proc freq data=abc.mydataset;
format _CHAR_ $missfmt.;
format _NUMERIC_ missfmt.;
tables var_1*(var_2 date_1 num_1 var_3 num_2 date_2 num_3) / NOCUM NOPERCENT MISSING list;
run;
The output will display as
var_1 value1 , var_2 value 1 counts and percents
var_1 value1 , var_2 value 2 counts and percents
Which usually takes less time.
If you are looking for only those cases where something is missing you might add a where statement:
where missing(var_1) or missing(var_2) or missing(date_1) <etc.>;
to reduce the number of records.
The first thing I tried was the LIST option as you suggested. This worked within 20 seconds. This leads me to believe you guys are all correct that the formatting is what is bogging it down.
One other thing you can try, run the forllowing
Proc Options group=performance; run;
within your SAS session to determine what settings you have inplace, then try to tweak them if you can. I'm talking about
- memsize=
- sortsize=
- sumsize=
- threads/nothreads
Changing the values of these options can yield better performance. The other SAS options you should have are Fullstimer, and msglevel
options fullstimer msglevel=i;
They will inform you of how your SAS system utlizing the available/allocated resources.
Hope this helps,
Ahmed
As @art297 and @ballardw say, it probably takes a lot of time because the cardinality of the crosstabulated variables is high.
However, I think most of the clock time is not in the tabulation, but in the development of the results file. In the 1,000 by 1,000 frequency below, note that with NOPRINT, the proc freq took only .81 secs, while with printing of a straightforward LISTING results was 4 time that: 3.32 sec's, When I enabled HTML results, the process took 2:40 (160 seconds, a ratio of about 200). (Editted additional note: it took about another 2 minutes for pc-sas to display the html result).
My suggestion. If it's too soon to reduce cardinality per Art's suggestion, (1) run the program with NOPRINT (be sure to use the OUT= option on your tables statements), or (2) make sure only LISTING is enabled for results.
5441 data test;
5442 do a=1 to 1000; do b=1 to 1000; output; end; end;
5443 run;
NOTE: The data set WORK.TEST has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
5444 proc freq data=test noprint;
5445 tables a*b / out=a_by_b;
5446 run;
NOTE: There were 1000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.A_BY_B has 1000000 observations and 4 variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.81 seconds
cpu time 0.81 seconds
5447 proc freq data=test ;
5448 tables a * b / out=a_by_b;
5449 run;
NOTE: There were 1000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.A_BY_B has 1000000 observations and 4 variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time 3.32 seconds
cpu time 3.32 seconds
Thanks everyone. I am going to try these suggestions and get back with you.
@mkeintz, I agree with you... I think creating the file may be the issue as well. In EG, I have the program task set to send the results to a PDF file. My understanding is that, in EG, this is the equivalent of the ODS statement to export to PDF. Is that correct? Would your suggestion work in EG programs?
I don't use EG, but I have a copy, so I fired it up. It has a TOOLS/OPTIONS window, which has a check box for "text output". Check thant and uncheck the "SAS Report" box. That's probably the equivalent to my suggestion.
Another option to speed things up: Use the NOPRINT option and write the output to a data set
proc freq noprint;
tables x y z / out=FreqOut;
run;
EDIT: Repeat of mkeintz suggestion
The problem is that SAS is building a large sparse matrix for the results.
For your case where lower vars are exclusive this means using more memory than needed.
This paper discusses how to concatenate all vars in the x-tab into one character variable
to reduce time.
Summarizing Impossibly Large SAS Data Sets for the Data Warehouse Server Using Horizontal Summarization
Michael A. Raithel
SUGI 2000
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.