BookmarkSubscribeRSS Feed
Uknown_user
Quartz | Level 8

Hi, I encountered problem when running proc freq. I have huge table which I splitted into few subsets with sql query and looped through them. The problem occurs when the object (subset) with 22m+ lines and 105 variables is processed. I tried to further break it column-wise with another loop and apply the proc freq on the smaller bits with no sucess.

 

Here is the screen of the error:

 

error.PNG

 

Perhaps, the code it self might be simplified significantly. Here below is the code which fails to produce desired output due to insufficient memory (please note that the first two tabs are created just for demonstration purposes):

 

data loop;
do i = 1 to 1000;
output;
end;
run;

data demonstrate;
set loop;
if mod(i,2) = 0 then _b = 'value1';
else _b = 'value2';
_x = floorz(abs(rand("NORMAL")*2))+1;
if _x > length(_b) then _x = length(_b)-1;
if _x = 0 then _x = 1;
_y = floorz(abs(rand("NORMAL")*2))+1;
if _y + _x > length(_b) then _y = 1;
_c = substr(_b,_y,_x);
d = cats(of _:);
run;

proc contents
	data=demonstrate Position noprint
	out=out1	 (keep =	LIBNAME
							MEMNAME
							NAME
							TYPE
							LENGTH
							FORMAT
							FORMATL
							INFORMAT
							ENGINE
							MEMTYPE);
run;

data out2;
retain	LIBNAME
		MEMNAME
		NAME
		TYPE
		LENGTH
		FORMAT
		FORMATL
		INFORMAT
		ENGINE
		MEMTYPE;		
set out1;
run;

proc format;
	value $ missfmt ' ' = "Missing" other = "Not_Missing";
	value nmissfmt . = "Missing" other = "Not_Missing";
run;

ods table onewayfreqs=out3;
proc freq
	data=demonstrate;
	table _all_ / missing;
	format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods table onewayfreqs=out4;
proc freq data=demonstrate order=freq;
	table _all_ / list nopercent nocum;
run;

proc sort data=out4;
by table;
run;

data out5;
set out4 (keep=table frequency f_:);
table = substr(table,7,length(table)-6);
by table;
if first.table then count = 0;
count + 1;
if count le 5 then output;
drop count;
run;

data out6 (drop=F_:);
set out5;
variable = cats(of F_:);
run;

proc transpose data=out6 out=out7 (drop=_name_) prefix=freq;
by table;
run;

proc transpose data=out6 (keep=table variable) out=out8 (drop=_name_) prefix=val;
by Table;
var variable;
run;

data out9 (rename=(table=name));
retain table val1 val2 val3 val4 val5 freq1 freq2 freq3 freq4 freq5;
merge
out7 (in=a)
out8  (in=b);
by table;
if a;
run;

data out10;
	length variable $32. variable_value $50.;
	set out3;
	Variable=scan(table, 2);
	Variable_Value=strip(trim(vvaluex(variable)));
	presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
	keep variable variable_value frequency percent cum: presentation;
	label variable='Variable' variable_value='Variable Value';
run;

proc sort data=out10;
	by variable;
run;

proc transpose data=out10 out=out11;
	by variable;
	id variable_value;
	var presentation;
run;

data out12;
merge
out2  (in=a)
out11 (rename=(variable=name) drop=_NAME_ in=b);
by NAME;
if a;
run;

data out13;
length name $300.;
merge
out12 (in=a)
out9 (in=b);
by name;
if a;
run;

My aim is to have the data structure as can be seen in dataset called out13.

 

Any suggestion how to simplify and rewrite the script so that it worked and provided me with desired output? Thanks

35 REPLIES 35
Reeza
Super User

The code above produces the error in your post?

Uknown_user
Quartz | Level 8

No that is just a code that demonstrates desired output, I cannot post the original data here. The original dataset has 100m+ rows.

ballardw
Super User

Can you describe in narrative some of what you are attempting?

 

I have no clue what the purpose of your Val1 - Val5 variables might be.

 

Are you trying to find the five most common values for all variables in a dataset??

 

 

 

 

Uknown_user
Quartz | Level 8

Yes, I want to find 5 most common values and their occurances as well as the basic description of the data set (i.e. location, name of table, all variables, etc.). Thanks

Shmuel
Garnet | Level 18

The error message mentions file work.copy_1.

This dataset name does not exist in the code you have posted.

 

Getting insufficient memory is usually the consequence of too many values to check

(maybe each value has frequency of 1) .

Uknown_user
Quartz | Level 8

I removed all of the identifiers (i.e. unique values) so there should no longer be variables that make no sense to find 5 most common values.

Uknown_user
Quartz | Level 8

Do you know if the memory is cleaned within each run of loop? I tried to clean workspace in each run of the loop but still it does not help and it crashes after processing ~ 1/2 of the data from table that has 22m+ rows and 105 cols. Is there anyway how to clean the memory but store information about the nested loops (i.e. in which run they were)?

Shmuel
Garnet | Level 18

Please post the full log, including the code (use options source source2 mprint;)  - of the crashing step.

 

Is the input dataset, to that step, sorted ? if yes - by what key variables ?

 

Uknown_user
Quartz | Level 8

The input table is not sorted. However, it returned reasonable error proving that there was left one variable that was nearly unique across lines which in fact probably caused the error. I removed it and now the script appears to be working. Do you have any suggestion how to simplify the posted code? It should be way more straightforward to get the structure of data as can ben seen in dataset out13?

Shmuel
Garnet | Level 18

I'm trying to summarize situation up to now:

1) You overcome the problem of INSUFFICIENT SPACE. 

    

2) I quote from your post to @ballardw :

    

Yes, I want to find 5 most common values and their occurances as well as the basic description of the data set (i.e. location, name of table, all variables, etc.)

 

3) It is difficult to me to follow your code - what was the intention of each step.

     Please do:

    - post a sample (few lines with most relevant columns) of the input dataset to begin from.

      Is it the output of PROC FREQ that caused the crash ?

    - Post your logic to create the desired report - one statement per step.

    - Post the format of the desired report. Concentrate on the relevant columns you show in the sample.

 

Doing that, you may come yourself to the solution how to simplify your code,

otherwise, we are here to help you.

Uknown_user
Quartz | Level 8

Ad 1) Yes, the problem with insufficient memory has been overcome.

Ad 2) My aim is to have basic description of the file, i.e. location of the table (libname), name of table (memname), list of variables and all other characteristics of given dataset (format, informat, length, etc.) + 5 most common values and the number of their occurances.

Ad 3) The format of the output is given by dataset out13:

Capture.PNGCapture1.PNG

 

I think it should be possible to get this format with out so many formatting steps (as you can see in the posted code).

I need to analyze all of the variables (5 most common values and their occurances) except these that are unique. The crash occured while processing unique/nearly unique (for e.g. cellphone numbers) values in given column, after removing them, it worked.

Uknown_user
Quartz | Level 8

I would need to understand how the memory is allocated for SAS script execution. As I wrote within original post, the loop goes through one variable (country) that splits the huge table into bits. Afterwards, the proc freq are executed for each single subset. It worked (even for the greatest subset) after removing unique values when I ran it for a single country, however when I tried to run it whole (to go through all countries within given loop), it failed again reporting same error as posted originally.

 

Any suggestion how to overcome this problem? Thanks

Shmuel
Garnet | Level 18
You have not posted sample of freq-output with relevant variables as I asked. Is country one of them ?
Uknown_user
Quartz | Level 8

Within the final (formatted output) there are 105 lines (variables) and country is one of them.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 2006 views
  • 10 likes
  • 7 in conversation