- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is a challenge for me especially with insufficient memory running proc freq or proc sql counts. I split the text fields into individual words and just have one column called "word". It had 140mil words, I split the data into three datasets as 50+50+40 mil. But still I can't run proc freq even using order=freq option.
Code is like below...
proc freq data=inputdata order=freq;
tables word / noprint nocum nopercent out=outputdata;
run;
I get the below...
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 3347033 observations read from the data set inputdata.
I wanted to get counts of each word and sort them in descending order.
I will be fine even having top 500 words with their counts or top 100 words with their frequency. I tried nocum nopercent etc., to avoid larger dataset, but still I am getting lot of errors related to memory issues. I even try sending into out dataset but similar memory issues show up. Please suggest if I can use PROC IML or other effective procedures that saves memory.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If @ballardw 's suggestion doesn't work, then you could sort the dataset by word - then create a dataset with the word value and word frequency, using a data step or a proc using a "by word" statement. You then would have a dataset with two vars: word and word_freq:
proc sort data=inputdata out=need;
by word;
run;
data outputdata;
do word_freq=1 by 1 until (last.word);
set need;
by word;
end;
run;
It might take a while to do the sort, but it is a simple program.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Add the option NOPRINT to the proc freq statement and see if that helps.
Most of the "memory" used was likely trying to create a results table with millions of rows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, even with the noprint option I got memory issues....and it stopped right after 5mil obs
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 5242529 observations read from the data set xxxxx
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Venkat4 wrote:
Thanks, even with the noprint option I got memory issues....and it stopped right after 5mil obs
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 5242529 observations read from the data set xxxxx
I am sure PROC FREQ is trying to build some type of table of all of the values it finds in the dataset (probably a hash table). So if you have 5 million different values of WORD then it needs at least 5 million * (length(word) +8 bytes for the count variable) bytes of memory. If WORD is very long that quickly works out to a lot of memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Venkat4 wrote:
Thanks, even with the noprint option I got memory issues....and it stopped right after 5mil obs
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: There were 5242529 observations read from the data set xxxxx
Which version of SAS are you running? If you are connecting to a server you may be running into memory limits, i.e. disk space, because of your admin set limits.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you just want to count how many times each value appears (and don't need to generate an statistics) then there is no need to use PROC FREQ. You could count with a simple SQL statement.
proc sql;
create table outputdata as
select word,count(*) as count
from inputdata
group by word
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, this is what I tried in the beginning. It stops right away due to memory issues. I am using the grid and they have 500GB limit for each user, this is automatically shuts down because it is taking more than 500GB - just the dataset itself is only 2GB. I think it is using some kind of intermittent table for calculations and that is taking too much space even using word or permanent library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Venkat4 wrote:
Thanks, this is what I tried in the beginning. It stops right away due to memory issues. I am using the grid and they have 500GB limit for each user, this is automatically shuts down because it is taking more than 500GB - just the dataset itself is only 2GB. I think it is using some kind of intermittent table for calculations and that is taking too much space even using word or permanent library.
I think you are confusing memory limits (which is what your error messages are referring to) with disk storage limit (which is almost certainly the 500GB you refer to). If the SQL is actually reporting termination due to memory limits, it's not the 500GB - it's some other limit.
To find out how much memory your SAS program typically has available, run this to get a report:
proc options option=memsize;
run;
Then you can check with your sysadmin's to see the maximum amount of ram you can request when starting a sas session - and how to do it in your environment.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If @ballardw 's suggestion doesn't work, then you could sort the dataset by word - then create a dataset with the word value and word frequency, using a data step or a proc using a "by word" statement. You then would have a dataset with two vars: word and word_freq:
proc sort data=inputdata out=need;
by word;
run;
data outputdata;
do word_freq=1 by 1 until (last.word);
set need;
by word;
end;
run;
It might take a while to do the sort, but it is a simple program.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How long is your "word" variable defined, and how long is the longest value in it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since any utility file is uncompressed (and the in-memory structures of procedures like FREQ also), this will invariably cause problems.
Check for max(length(word)), and adjust your dataset accordingly. Also consider converting all words to lowercase, to avoid differences caused by words starting a sentence. That reduces the size of the in-memory tables.
After all, in everyday language the number of words actually used is a few 1000, so this should not cause any memory problems.
The Oxford Dictionary of English contains 355000 words, so even with a word length of 100 you'd need just 40 MB (50 including a search tree) for the freq table.