Hi, I encountered problem when running proc freq. I have huge table which I splitted into few subsets with sql query and looped through them. The problem occurs when the object (subset) with 22m+ lines and 105 variables is processed. I tried to further break it column-wise with another loop and apply the proc freq on the smaller bits with no sucess.
Here is the screen of the error:
Perhaps, the code it self might be simplified significantly. Here below is the code which fails to produce desired output due to insufficient memory (please note that the first two tabs are created just for demonstration purposes):
data loop;
do i = 1 to 1000;
output;
end;
run;
data demonstrate;
set loop;
if mod(i,2) = 0 then _b = 'value1';
else _b = 'value2';
_x = floorz(abs(rand("NORMAL")*2))+1;
if _x > length(_b) then _x = length(_b)-1;
if _x = 0 then _x = 1;
_y = floorz(abs(rand("NORMAL")*2))+1;
if _y + _x > length(_b) then _y = 1;
_c = substr(_b,_y,_x);
d = cats(of _:);
run;
proc contents
	data=demonstrate Position noprint
	out=out1	 (keep =	LIBNAME
							MEMNAME
							NAME
							TYPE
							LENGTH
							FORMAT
							FORMATL
							INFORMAT
							ENGINE
							MEMTYPE);
run;
data out2;
retain	LIBNAME
		MEMNAME
		NAME
		TYPE
		LENGTH
		FORMAT
		FORMATL
		INFORMAT
		ENGINE
		MEMTYPE;		
set out1;
run;
proc format;
	value $ missfmt ' ' = "Missing" other = "Not_Missing";
	value nmissfmt . = "Missing" other = "Not_Missing";
run;
ods table onewayfreqs=out3;
proc freq
	data=demonstrate;
	table _all_ / missing;
	format _numeric_ nmissfmt. _character_ $missfmt.;
run;
ods table onewayfreqs=out4;
proc freq data=demonstrate order=freq;
	table _all_ / list nopercent nocum;
run;
proc sort data=out4;
by table;
run;
data out5;
set out4 (keep=table frequency f_:);
table = substr(table,7,length(table)-6);
by table;
if first.table then count = 0;
count + 1;
if count le 5 then output;
drop count;
run;
data out6 (drop=F_:);
set out5;
variable = cats(of F_:);
run;
proc transpose data=out6 out=out7 (drop=_name_) prefix=freq;
by table;
run;
proc transpose data=out6 (keep=table variable) out=out8 (drop=_name_) prefix=val;
by Table;
var variable;
run;
data out9 (rename=(table=name));
retain table val1 val2 val3 val4 val5 freq1 freq2 freq3 freq4 freq5;
merge
out7 (in=a)
out8  (in=b);
by table;
if a;
run;
data out10;
	length variable $32. variable_value $50.;
	set out3;
	Variable=scan(table, 2);
	Variable_Value=strip(trim(vvaluex(variable)));
	presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
	keep variable variable_value frequency percent cum: presentation;
	label variable='Variable' variable_value='Variable Value';
run;
proc sort data=out10;
	by variable;
run;
proc transpose data=out10 out=out11;
	by variable;
	id variable_value;
	var presentation;
run;
data out12;
merge
out2  (in=a)
out11 (rename=(variable=name) drop=_NAME_ in=b);
by NAME;
if a;
run;
data out13;
length name $300.;
merge
out12 (in=a)
out9 (in=b);
by name;
if a;
run;My aim is to have the data structure as can be seen in dataset called out13.
Any suggestion how to simplify and rewrite the script so that it worked and provided me with desired output? Thanks
The code above produces the error in your post?
No that is just a code that demonstrates desired output, I cannot post the original data here. The original dataset has 100m+ rows.
Can you describe in narrative some of what you are attempting?
I have no clue what the purpose of your Val1 - Val5 variables might be.
Are you trying to find the five most common values for all variables in a dataset??
Yes, I want to find 5 most common values and their occurances as well as the basic description of the data set (i.e. location, name of table, all variables, etc.). Thanks
The error message mentions file work.copy_1.
This dataset name does not exist in the code you have posted.
Getting insufficient memory is usually the consequence of too many values to check
(maybe each value has frequency of 1) .
I removed all of the identifiers (i.e. unique values) so there should no longer be variables that make no sense to find 5 most common values.
Do you know if the memory is cleaned within each run of loop? I tried to clean workspace in each run of the loop but still it does not help and it crashes after processing ~ 1/2 of the data from table that has 22m+ rows and 105 cols. Is there anyway how to clean the memory but store information about the nested loops (i.e. in which run they were)?
Please post the full log, including the code (use options source source2 mprint;) - of the crashing step.
Is the input dataset, to that step, sorted ? if yes - by what key variables ?
The input table is not sorted. However, it returned reasonable error proving that there was left one variable that was nearly unique across lines which in fact probably caused the error. I removed it and now the script appears to be working. Do you have any suggestion how to simplify the posted code? It should be way more straightforward to get the structure of data as can ben seen in dataset out13?
I'm trying to summarize situation up to now:
1) You overcome the problem of INSUFFICIENT SPACE.
2) I quote from your post to @ballardw :
Yes, I want to find 5 most common values and their occurances as well as the basic description of the data set (i.e. location, name of table, all variables, etc.)
3) It is difficult to me to follow your code - what was the intention of each step.
Please do:
- post a sample (few lines with most relevant columns) of the input dataset to begin from.
Is it the output of PROC FREQ that caused the crash ?
- Post your logic to create the desired report - one statement per step.
- Post the format of the desired report. Concentrate on the relevant columns you show in the sample.
Doing that, you may come yourself to the solution how to simplify your code,
otherwise, we are here to help you.
Ad 1) Yes, the problem with insufficient memory has been overcome.
Ad 2) My aim is to have basic description of the file, i.e. location of the table (libname), name of table (memname), list of variables and all other characteristics of given dataset (format, informat, length, etc.) + 5 most common values and the number of their occurances.
Ad 3) The format of the output is given by dataset out13:
I think it should be possible to get this format with out so many formatting steps (as you can see in the posted code).
I need to analyze all of the variables (5 most common values and their occurances) except these that are unique. The crash occured while processing unique/nearly unique (for e.g. cellphone numbers) values in given column, after removing them, it worked.
I would need to understand how the memory is allocated for SAS script execution. As I wrote within original post, the loop goes through one variable (country) that splits the huge table into bits. Afterwards, the proc freq are executed for each single subset. It worked (even for the greatest subset) after removing unique values when I ran it for a single country, however when I tried to run it whole (to go through all countries within given loop), it failed again reporting same error as posted originally.
Any suggestion how to overcome this problem? Thanks
Within the final (formatted output) there are 105 lines (variables) and country is one of them.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
