I am working with many large datasets. Each dataset has millions of observations, and they range in size from 1GB to more than 100GB. I need generate for each variable the number of levels, missing levels, non-missing levels, number of observations, and the ratio of levels to observations. I am using the following code. However, SAS frequently reports insufficient memory and stops processing. I would like to learn a more efficient approach to producing the variable descriptions without running out of memory. Any suggestions would be appreciated.
proc freq nlevels data= mydata ;
ods output nlevels=nlevels;
tables _all_ / noprint ;
run;
data want ;
if 0 then set mydata (drop=_all_) nobs=nobs ;
set nlevels;
total=nobs;
unique_ratio = nlevels/total ;
run;
proc print;
run;
1. You could increase the memory allocated to SAS.
2. You could reduce the number of variables analysed in one go, for example
tables _NUMERIC_
then
tables _CHAR_
instead of
tables _ALL_
3. You could remove obviously high-cardinality variables. For example BANK_ACCT_BALANCE will obviously have a very high NLEVELS whose calculation brings no value.
4. proc sql allows you to calculate number of levels, missing levels, non-missing levels, number of observations, and the ratio of levels to observations in one step. You need to explicitly code the variable names though, and the memory issue remains. But at least you only have one pass per variable.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.