BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Venkat4
Quartz | Level 8

This is a challenge for me especially with insufficient memory running proc freq or proc sql counts. I split the text fields into individual words and just have one column called "word". It had 140mil words, I split the data into three datasets as 50+50+40 mil. But still I can't run proc freq even using order=freq option.

Code is like below...

proc freq data=inputdata order=freq;

tables word / noprint nocum nopercent out=outputdata;

run;

 

I get the below...

ERROR: The SAS System stopped processing this step because of insufficient memory.

NOTE: There were 3347033 observations read from the data set inputdata.

 

I wanted to get counts of each word and sort them in descending order.

I will be fine even having top 500 words with their counts or top 100 words with their frequency. I tried nocum nopercent etc., to avoid larger dataset, but still I am getting lot of errors related to memory issues. I even try sending into out dataset but similar memory issues show up. Please suggest if I can use PROC IML or other effective procedures that saves memory.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If @ballardw 's suggestion doesn't work, then you could sort the dataset by word - then create a dataset with the word value and word frequency, using a data step or a proc using a "by word" statement.  You then would have a dataset with two vars: word and word_freq:

 

proc sort data=inputdata out=need;
  by word;
run;

data outputdata;
  do word_freq=1 by 1 until (last.word);
    set need;
    by word;
  end;
run;

It might take a while to do the sort, but it is a simple program.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

17 REPLIES 17
ballardw
Super User

Add the option NOPRINT to the proc freq statement and see if that helps.

 

Most of the "memory" used was likely trying to create a results table with millions of rows.

Venkat4
Quartz | Level 8

Thanks, even with the noprint option I got memory issues....and it stopped right after 5mil obs

 

ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 5242529 observations read from the data set xxxxx

Tom
Super User Tom
Super User

@Venkat4 wrote:

Thanks, even with the noprint option I got memory issues....and it stopped right after 5mil obs

 

ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 5242529 observations read from the data set xxxxx


I am sure PROC FREQ is trying to build some type of table of all of the values it finds in the dataset (probably a hash table).  So if you have 5 million different values of WORD then it needs at least 5 million * (length(word) +8 bytes for the count variable) bytes of memory.  If WORD is very long that quickly works out to a lot of memory.

ballardw
Super User

@Venkat4 wrote:

Thanks, even with the noprint option I got memory issues....and it stopped right after 5mil obs

 

ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 5242529 observations read from the data set xxxxx


Which version of SAS are you running? If you are connecting to a server you may be running into memory limits, i.e. disk space, because of your admin set limits.

Venkat4
Quartz | Level 8
It is SAS 9.4 M5 running as a client-server via EG 7.15
Tom
Super User Tom
Super User

If you just want to count how many times each value appears (and don't need to generate an statistics) then there is no need to use PROC FREQ.  You could count with a simple SQL statement.

proc sql;
create table outputdata as
  select word,count(*) as count
  from inputdata 
  group by word
;
quit;
Venkat4
Quartz | Level 8

Thanks, this is what I tried in the beginning. It stops right away due to memory issues. I am using the grid and they have 500GB limit for each user, this is automatically shuts down because it is taking more than 500GB - just the dataset itself is only 2GB. I think it is using some kind of intermittent table for calculations and that is taking too much space even using word or permanent library. 

mkeintz
PROC Star

@Venkat4 wrote:

Thanks, this is what I tried in the beginning. It stops right away due to memory issues. I am using the grid and they have 500GB limit for each user, this is automatically shuts down because it is taking more than 500GB - just the dataset itself is only 2GB. I think it is using some kind of intermittent table for calculations and that is taking too much space even using word or permanent library. 


I think you are confusing memory limits (which is what your error messages are referring to) with disk storage limit (which is almost certainly the 500GB you refer to).  If the SQL is actually reporting termination due to memory limits, it's not the 500GB - it's some other limit.

To find out how much memory your SAS program typically has available, run this to get a report:

proc options option=memsize;
run;

Then you can check with your sysadmin's to see the maximum amount of ram you can request when starting a sas session - and how to do it in your environment.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Venkat4
Quartz | Level 8
Understood, thank you.
mkeintz
PROC Star

If @ballardw 's suggestion doesn't work, then you could sort the dataset by word - then create a dataset with the word value and word frequency, using a data step or a proc using a "by word" statement.  You then would have a dataset with two vars: word and word_freq:

 

proc sort data=inputdata out=need;
  by word;
run;

data outputdata;
  do word_freq=1 by 1 until (last.word);
    set need;
    by word;
  end;
run;

It might take a while to do the sort, but it is a simple program.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Venkat4
Quartz | Level 8
Thank you, much. This worked with the 3 way split dataset fairly quickly (15-20min each) and didn't throw any issues. I am working on combining them and getting the final counts.
Venkat4
Quartz | Level 8
It assigned 10000 not sure why, I don't have any longer word. I see that could be the reason this whole thing is blowing up. Thanks.
Kurt_Bremser
Super User

Since any utility file is uncompressed (and the in-memory structures of procedures like FREQ also), this will invariably cause problems.

Check for max(length(word)), and adjust your dataset accordingly. Also consider converting all words to lowercase, to avoid differences caused by words starting a sentence. That reduces the size of the in-memory tables.

 

After all, in everyday language the number of words actually used is a few 1000, so this should not cause any memory problems.

The Oxford Dictionary of English contains 355000 words, so even with a word length of 100 you'd need just 40 MB (50 including a search tree) for the freq table.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 4496 views
  • 7 likes
  • 8 in conversation