Proc freq insufficient memory error while processing huge table - Page 2

Shmuel · Posted 11-03-2017 10:46 AM

Can you sort the proc freq output by variable name and its frequency ?

Assuming YES:

- Select first 5 rows (maybe for some variables there are less than 5) - per variable name

and transpose it (may be done by retaining array value1-value5 freq1-freq5)

geting output one line paer variable.

- merge it with proc contents (or seql dictionary.columns) output

- create your report.

Uknown_user · Posted 11-03-2017 11:54 AM

I understand what you are saying but is it possible for you to post your code here that would return desired output for sample data I posted here? Assuming it can be in 3 single steps: sort the output of proc freq (select just 5 records), transpose it somehow and finally merge it with proc contents?

Reeza · Posted 11-03-2017 12:21 PM

What about ties?

data demonstrate;
set loop;
if mod(i,2) = 0 then _b = 'value1';
else _b = 'value2';
_x = floorz(abs(rand("NORMAL")*2))+1;
if _x > length(_b) then _x = length(_b)-1;
if _x = 0 then _x = 1;
_y = floorz(abs(rand("NORMAL")*2))+1;
if _y + _x > length(_b) then _y = 1;
_c = substr(_b,_y,_x);
d = cats(of _:);
run;

ods table onewayfreqs=temp;
proc freq data=demonstrate;
	table _all_;
run;

*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency percent cum:;
label variable='Variable' 
	variable_value='Variable Value';
run;

proc sort data=want;
by variable descending frequency;
run;

data want2;
set want;
by variable;
if first.variable then count=0;
count+1;
if count<=5;
run;

Uknown_user · Posted 11-03-2017 12:44 PM

Thanks for your suggestion, the data needs to be transposed and afterwards merged with proc contents. It is still not clear to me how to get the information about completeness (i.e. total missing obs and total non-missing obs). Thanks

Reeza · Posted 11-03-2017 12:50 PM

@Uknown_user wrote:

Thanks for your suggestion, the data needs to be transposed and afterwards merged with proc contents.

That should be within your skill sets.

It is still not clear to me how to get the information about completeness (i.e. total missing obs and total non-missing obs). Thanks

You realize this is the first time you've mentioned this requirement?

There are macro's on here that illustrate how to calculate all missing/completeness report.

Have you tried the CHARACTERIZE DATA task in EG? It doesn't do a lot of this already? (Tasks>Describe>Characterize)

PROC FREQ can do the missing summary, see this example here:

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

The remaining information you get from PROC CONTENTS or SASHELP.VCOLUMN. I think you have all the components now, you need to figure out how to incorporate/combine them all into what you want as a final output.

Good Luck.

Uknown_user · Posted 11-03-2017 01:16 PM

Thanks for your suggestions. I never tried build-in macros, let us give a try (already started execution of this one). I have adopted the solution posted on github within my script - thanks for providing me with the link. Is there any way how to call these build-in macros on datasets created on a go (within a loop for example)?

Shmuel · Posted 11-04-2017 04:48 PM

It seems that the code @Reeza posted is simmlar to the logic I prposed, so next code is based on @Reeza's code

with some changes:

1) I have added to the 1st step, creating the demostarate dataset some missing values.

2) I have added to the last step creation of the arrays - most 5 frequncies and ther values,

amount of missing values and of total values per variable.

data demonstrate;
do i=1 to 1000;
if mod(i,2) = 0 then _b = 'value1';
else _b = 'value2';
_x = floorz(abs(rand("NORMAL")*2))+1;
if _x > length(_b) then _x = length(_b)-1;
if _x = 0 then _x = 1;

if i in(10,30,100,400) then _x = .;   /* added some missing values */
_m = .;                               /* added new variable missing at all obs. */
_y = floorz(abs(rand("NORMAL")*2))+1;
if _y + _x > length(_b) then _y = 1;
_c = substr(_b,_y,_x);
d = cats(of _:);
output;
end;
run;

ods table onewayfreqs=temp;
proc freq data=demonstrate(drop=i); /* drop unique */
	table _all_ / missing nopercent nocum;
run;

data want;
length variable $32. variable_value $50.;
set temp;
Variable=upcase(strip(scan(table, 2)));

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency;
label variable='Variable' 
	variable_value='Variable Value';
run;

proc sort data=want;
by variable descending frequency;
run;

data want2;
set want;
by variable;
retain val1-val5 frq1-frq5 misfrq totfreq;
array valx $ val1-val5;
array frqx frq1-frq5;

if first.variable then do;
   count=0; misfrq=0; totfreq=0;;
   do i=1 to 5;
      call missing(valx(i));
      call missing(frqx(i));
   end;
end;

totfreq = sum(of totfreq, frequency);
if upcase(variable) = '_X' then put totfreq=;

if variable_value in(' ' '.')  
then misfrq=frequency;

count+1;
if count le 5 then do;;
   valx(count) = variable_value;
   frqx(count) = frequency;
end;
if last.variable then output;
drop i count variable_value frequency;
run;

Uknown_user · Posted 11-04-2017 06:01 PM

Thank you for your help. I have adopted solution with build-in macro "characterize" and after few modifications, it worked (ran sucessfully for all countries). Your code works on the sample data, I will try it on the actual data and measure run times. Thanks again!

Rick_SAS · Posted 11-03-2017 12:43 PM

When you use PROC FREQ on a table, the default behavior is to print the results. Since your input data has 22 million rows, is it possible the problem is that the variable has many unique values and that PROC FREQ is trying to create an HTML file that contains 105 tables, some of which have millions of rows? Try using the NOPRINT option on PROC FREQ and use the OUT= option to create a data set of the results. Then you can use PROC CONTENTS to see how many rows are in the output data set.

Also, replace the _ALL_ keyword with a particular variable name that you know has only a few levels, such as SEX or GENDER.

Uknown_user · Posted 11-03-2017 01:03 PM

Thanks for your contribution to this post. Well yes, the default behaviour is to print a html file, however, if this exceeds limit, it simply pups up a window (when the script finishes execution) that asks if I wish to display the results or not, by clicking no, it displays nothing. Unfortunatelly, the html issue is not my case.

The error returned by SAS indicated that it stopped after hundreds of thousands records of given variable (originally not posted here) which provides one with clear information where the problem was. It cannot bear that many unique values for given variable, after exlusion of 4 variables (that were either unique or nearly unique) the script ran successfully. This is the example of returned error:

Do you think it will help if I break these subsets column-wise and loop through so it process just few variables at once?

Reeza · Posted 11-03-2017 01:07 PM

Different question:

Does it make sense to have the most frequent five values for variables with that many unique values? I suspect not...I suspect in those case a statistical summary is more effective, ie mean/median/range etc.

Uknown_user · Posted 11-03-2017 01:18 PM

I excluded these variables from the script thought.

Rick_SAS · Posted 11-03-2017 01:15 PM

The first part of your program applies formats to find the number of missing/nonmissing values. You seem to have used the tips in the article "Count the number of missing values for each variable." If you link to references that you used, that helps us understand what you are trying to accomplish.

When you use PROC FREQ to find the five most common values, you need to be a little careful. PROC FREQ is intended for categorical data analysis. The variables that you analyze should have discrete values, such as race, gender, and type of automobile. For continuous variables, you can use PROC MEANS or PROC UNIVARIATE to reveal basic descriptive statistics.

Uknown_user · Posted 11-03-2017 01:22 PM

I am trying to run it on consumer base table which consists of many countries. There are many different types of data ranging from telephone numbers to email addresses and gender or dates of birth etc. Each of these (except unique identifiers) is desired to be analyzed thought.

Reeza · Posted 11-03-2017 01:26 PM

@Uknown_user wrote:

I am trying to run it on consumer base table which consists of many countries. There are many different types of data ranging from telephone numbers to email addresses and gender or dates of birth etc. Each of these (except unique identifiers) is desired to be analyzed thought.

I can guarantee you that the 5 most frequent email addresses and date of birth are useless.

Break up your variables by type, categorize them as categorical/numerical and use the appropriate summaries for each.

You can still automate the analysis by the different types.

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Re: Proc freq insufficient memory error while processing huge table

Classroom Training Available!