BookmarkSubscribeRSS Feed
elwayfan446
Barite | Level 11

Hi,

 

I have a proc frequency statement with multiple variables for cross tablulation.

 

Example: 

 

proc freq data=abc.mydataset;
format _CHAR_ $missfmt.;
format _NUMERIC_ missfmt.;
tables var_1*(var_2 date_1 num_1 var_3 num_2 date_2 num_3) / NOCUM NOPERCENT MISSING;
run;

 

This particular dataset has over 19,000 observations.  It took 1 hour to run the freq statement on my local machine. 

 

Any advice on ways to speed this up?  I know it is large.

10 REPLIES 10
art297
Opal | Level 21

Since 19,000 is an extremely small dataset for SAS, I have to think that some or all of your variables have unique values, resulting in tables that would be difficult to use (e.g., at worst case, trying to create 19,000 by 19,000 matrices for each combination.

 

If that is so, and you don't really need the actual comparison of values, one way to speed things up would be to create and include formats that grouped the results.

 

However, difficult to know without seeing the dataset and knowing what you are trying to accomplish.

 

Art, CEO, AnalystFinder.com

 

elwayfan446
Barite | Level 11
Yes, you are correct. A lot of these variables do have unique values. I may try the grouping if all else fails. I am just trying to keep the code to a minimum.
ballardw
Super User

It is somewhat likely that a large amount of the time is spent in formatting the output. 100 row values* 100 column values = 10,000 cells and that could be for just two variables.

You might try

proc freq data=abc.mydataset;
format _CHAR_ $missfmt.;
format _NUMERIC_ missfmt.;
tables var_1*(var_2 date_1 num_1 var_3 num_2 date_2 num_3) / NOCUM NOPERCENT MISSING list;
run;

 

The output will display as

var_1 value1 , var_2 value 1  counts and percents

var_1 value1 , var_2 value 2  counts and percents

 

Which usually takes less time.

 

If you are looking for only those cases where something is missing you might add a where statement:

where missing(var_1) or missing(var_2) or missing(date_1) <etc.>;

to reduce the number of records.

 

 

elwayfan446
Barite | Level 11

The first thing I tried was the LIST option as you suggested.  This worked within 20 seconds.  This leads me to believe you guys are all correct that the formatting is what is bogging it down. 

AhmedAl_Attar
Rhodochrosite | Level 12

One other thing you can try, run the forllowing

Proc Options group=performance; run;

within your SAS session to determine what settings you have inplace, then try to tweak them if you can. I'm talking about

- memsize=

- sortsize=

- sumsize=

- threads/nothreads

 

Changing the values of these options can yield better performance. The other SAS options you should have are Fullstimer, and msglevel

options fullstimer msglevel=i;

 

They will inform you of how your SAS system utlizing the available/allocated resources.

 

Hope this helps,

Ahmed

mkeintz
PROC Star

As @art297 and @ballardw say, it probably takes a lot of time because the cardinality of the crosstabulated variables  is high. 

 

However, I think most of the clock time is not in the tabulation, but in the development of the results file.   In the 1,000 by 1,000 frequency below, note that with NOPRINT, the proc freq took only .81 secs, while with printing of a straightforward LISTING results was 4 time that: 3.32 sec's,  When I enabled HTML results, the process took 2:40 (160 seconds, a ratio of about 200).  (Editted additional note: it took about another 2 minutes for pc-sas to display the html result).

 

My suggestion.  If it's too soon to reduce cardinality per Art's suggestion, (1) run the program with NOPRINT (be sure to use the OUT= option on your tables statements), or (2) make sure only LISTING is enabled for results.

 

 

5441   data test;
5442     do a=1 to 1000; do b=1 to 1000; output; end; end;
5443   run;

NOTE: The data set WORK.TEST has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds


5444   proc freq data=test noprint;
5445     tables a*b / out=a_by_b;
5446   run;

NOTE: There were 1000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.A_BY_B has 1000000 observations and 4 variables.
NOTE: PROCEDURE FREQ used (Total process time):
      real time           0.81 seconds
      cpu time            0.81 seconds


5447   proc freq data=test ;
5448     tables a * b / out=a_by_b;
5449   run;

NOTE: There were 1000000 observations read from the data set WORK.TEST.
NOTE: The data set WORK.A_BY_B has 1000000 observations and 4 variables.
NOTE: PROCEDURE FREQ used (Total process time):
      real time           3.32 seconds
      cpu time            3.32 seconds

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
elwayfan446
Barite | Level 11

Thanks everyone.  I am going to try these suggestions and get back with you.

 

@mkeintz, I agree with you... I think creating the file may be the issue as well.  In EG, I have the program task set to send the results to a PDF file.  My understanding is that, in EG, this is the equivalent of the ODS statement to export to PDF.  Is that correct? Would your suggestion work in EG programs?

mkeintz
PROC Star

I don't use EG, but I have a copy, so I fired it up.  It has a TOOLS/OPTIONS window, which has a check box for "text output".  Check thant and uncheck the "SAS Report" box.  That's probably the equivalent to my suggestion.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Rick_SAS
SAS Super FREQ

Another option to speed things up: Use the NOPRINT option and write the output to a data set

 

proc freq noprint;

tables x y z / out=FreqOut;

run;

 

EDIT: Repeat of 

Ron_MacroMaven
Lapis Lazuli | Level 10

The problem is that SAS is building a large sparse matrix for the results.

For your case where lower vars are exclusive this means using more memory than needed.

 

This paper discusses how to concatenate all vars in the x-tab into one character variable

to reduce time.

 

Summarizing Impossibly Large SAS Data Sets for the Data Warehouse Server Using Horizontal Summarization

 

Michael A. Raithel

 

SUGI 2000


http://www2.sas.com/proceedings/sugi25/25/sy/25p290.pdf

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2591 views
  • 2 likes
  • 7 in conversation