11-02-2017 04:56 PM - edited 11-03-2017 04:08 AM
Hi, I encountered problem when running proc freq. I have huge table which I splitted into few subsets with sql query and looped through them. The problem occurs when the object (subset) with 22m+ lines and 105 variables is processed. I tried to further break it column-wise with another loop and apply the proc freq on the smaller bits with no sucess.
Here is the screen of the error:
Perhaps, the code it self might be simplified significantly. Here below is the code which fails to produce desired output due to insufficient memory (please note that the first two tabs are created just for demonstration purposes):
data loop; do i = 1 to 1000; output; end; run; data demonstrate; set loop; if mod(i,2) = 0 then _b = 'value1'; else _b = 'value2'; _x = floorz(abs(rand("NORMAL")*2))+1; if _x > length(_b) then _x = length(_b)-1; if _x = 0 then _x = 1; _y = floorz(abs(rand("NORMAL")*2))+1; if _y + _x > length(_b) then _y = 1; _c = substr(_b,_y,_x); d = cats(of _:); run; proc contents data=demonstrate Position noprint out=out1 (keep = LIBNAME MEMNAME NAME TYPE LENGTH FORMAT FORMATL INFORMAT ENGINE MEMTYPE); run; data out2; retain LIBNAME MEMNAME NAME TYPE LENGTH FORMAT FORMATL INFORMAT ENGINE MEMTYPE; set out1; run; proc format; value $ missfmt ' ' = "Missing" other = "Not_Missing"; value nmissfmt . = "Missing" other = "Not_Missing"; run; ods table onewayfreqs=out3; proc freq data=demonstrate; table _all_ / missing; format _numeric_ nmissfmt. _character_ $missfmt.; run; ods table onewayfreqs=out4; proc freq data=demonstrate order=freq; table _all_ / list nopercent nocum; run; proc sort data=out4; by table; run; data out5; set out4 (keep=table frequency f_:); table = substr(table,7,length(table)-6); by table; if first.table then count = 0; count + 1; if count le 5 then output; drop count; run; data out6 (drop=F_:); set out5; variable = cats(of F_:); run; proc transpose data=out6 out=out7 (drop=_name_) prefix=freq; by table; run; proc transpose data=out6 (keep=table variable) out=out8 (drop=_name_) prefix=val; by Table; var variable; run; data out9 (rename=(table=name)); retain table val1 val2 val3 val4 val5 freq1 freq2 freq3 freq4 freq5; merge out7 (in=a) out8 (in=b); by table; if a; run; data out10; length variable $32. variable_value $50.; set out3; Variable=scan(table, 2); Variable_Value=strip(trim(vvaluex(variable))); presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")"); keep variable variable_value frequency percent cum: presentation; label variable='Variable' variable_value='Variable Value'; run; proc sort data=out10; by variable; run; proc transpose data=out10 out=out11; by variable; id variable_value; var presentation; run; data out12; merge out2 (in=a) out11 (rename=(variable=name) drop=_NAME_ in=b); by NAME; if a; run; data out13; length name $300.; merge out12 (in=a) out9 (in=b); by name; if a; run;
My aim is to have the data structure as can be seen in dataset called out13.
Any suggestion how to simplify and rewrite the script so that it worked and provided me with desired output? Thanks
11-03-2017 04:00 AM
No that is just a code that demonstrates desired output, I cannot post the original data here. The original dataset has 100m+ rows.
11-02-2017 05:34 PM
Can you describe in narrative some of what you are attempting?
I have no clue what the purpose of your Val1 - Val5 variables might be.
Are you trying to find the five most common values for all variables in a dataset??
11-03-2017 03:55 AM
Yes, I want to find 5 most common values and their occurances as well as the basic description of the data set (i.e. location, name of table, all variables, etc.). Thanks
11-02-2017 06:40 PM
The error message mentions file work.copy_1.
This dataset name does not exist in the code you have posted.
Getting insufficient memory is usually the consequence of too many values to check
(maybe each value has frequency of 1) .
11-03-2017 03:58 AM
I removed all of the identifiers (i.e. unique values) so there should no longer be variables that make no sense to find 5 most common values.
11-03-2017 05:01 AM
Do you know if the memory is cleaned within each run of loop? I tried to clean workspace in each run of the loop but still it does not help and it crashes after processing ~ 1/2 of the data from table that has 22m+ rows and 105 cols. Is there anyway how to clean the memory but store information about the nested loops (i.e. in which run they were)?
11-03-2017 05:34 AM
Please post the full log, including the code (use options source source2 mprint - of the crashing step.
Is the input dataset, to that step, sorted ? if yes - by what key variables ?
11-03-2017 05:49 AM
The input table is not sorted. However, it returned reasonable error proving that there was left one variable that was nearly unique across lines which in fact probably caused the error. I removed it and now the script appears to be working. Do you have any suggestion how to simplify the posted code? It should be way more straightforward to get the structure of data as can ben seen in dataset out13?
11-03-2017 08:08 AM
I'm trying to summarize situation up to now:
1) You overcome the problem of INSUFFICIENT SPACE.
2) I quote from your post to @ballardw :
Yes, I want to find 5 most common values and their occurances as well as the basic description of the data set (i.e. location, name of table, all variables, etc.)
3) It is difficult to me to follow your code - what was the intention of each step.
- post a sample (few lines with most relevant columns) of the input dataset to begin from.
Is it the output of PROC FREQ that caused the crash ?
- Post your logic to create the desired report - one statement per step.
- Post the format of the desired report. Concentrate on the relevant columns you show in the sample.
Doing that, you may come yourself to the solution how to simplify your code,
otherwise, we are here to help you.
11-03-2017 09:37 AM
Ad 1) Yes, the problem with insufficient memory has been overcome.
Ad 2) My aim is to have basic description of the file, i.e. location of the table (libname), name of table (memname), list of variables and all other characteristics of given dataset (format, informat, length, etc.) + 5 most common values and the number of their occurances.
Ad 3) The format of the output is given by dataset out13:
I think it should be possible to get this format with out so many formatting steps (as you can see in the posted code).
I need to analyze all of the variables (5 most common values and their occurances) except these that are unique. The crash occured while processing unique/nearly unique (for e.g. cellphone numbers) values in given column, after removing them, it worked.
11-03-2017 10:46 AM
I would need to understand how the memory is allocated for SAS script execution. As I wrote within original post, the loop goes through one variable (country) that splits the huge table into bits. Afterwards, the proc freq are executed for each single subset. It worked (even for the greatest subset) after removing unique values when I ran it for a single country, however when I tried to run it whole (to go through all countries within given loop), it failed again reporting same error as posted originally.
Any suggestion how to overcome this problem? Thanks