BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
nkm123
Calcite | Level 5

I have a dataset with 30MM observations and 100 variables. 70 variables are char and 30 are num. I need to do proc freq for all char variables where distict count of value would be no more than 50 (means where level of char variables would be less than or equal to 50).

 

I tried following approach:-

 

1) Proc freq with NLEVELS to get unique count of all char variable so can filter where count is <= 50 and use this list to get proc freq.

 

problem :- getting memory error so I think I can't use it.

 

here is code

 

ods select nlevels;

ods output nlevels=miss_vars(rename = (NLevels = UniqueCount NMissLevels = MissingCount NNonMissLevels = NonMissingUniCount )) ;

ods noresults;

proc freq data=&dsn. nlevels;

run;

ods results;

 

error:-

Not enough memory for all variables. Variable xxxx removed after 52886 levels.

 

2) using proc sort and proc freq :-

 

This gives me what I am looking for but this one takes almost 8 - 10 hrs to complete.

 

%macro freq_rept (ds=&dsn.,cutoff=50,titleyn=1);

proc contents data=&ds out=freqchk1 noprint;

run;

data _null_;

set freqchk1 (keep=name) end=last;

call symput('var'||trim(left(_n_)),trim(left(name)));

if last then call symput('count',trim(left(_n_)));

run;

%do i=1 %to &count;

proc sort data=&ds out=freqchk2 (keep=&&var&i) nodupkey;

by &&var&i;

run;

data _null_;

set freqchk2 end=last;

if last then do;

if _n_<=&cutoff then flag=1;

else flag=0;

call symput("flag&i",trim(left(flag)));

end;

run;

proc datasets library=work nolist;

delete freqchk2;

quit;

%end;

%let okflag=0;

%do i=1 %to &count;

%if &&flag&i=1 %then %let okflag=1;

%end;

%if &titleyn=1 %then %str(title "&ds";);

%else %str(title;);

%if &okflag=1 %then %do;

proc freq data=&ds;

table %do i=1 %to &count;

%if &&flag&i=1 %then %do;

&&var&i

%end;

%end;;

%end;

proc datasets library=work nolist;

delete freqchk1;

quit;

%mend freq_rept;

 

3) Using proc sql and distict(var_name) to get distinct value and use them to filter where counts have <= 50 to create var list.

 

This is also taking time and even some time give me memory error.

 

Is there any method that would easy to implement and  get me freq without taking so much time or without throwing me memory error. I can't change memory size and using base SAS (EG 5.1).

 

Is there any way to stop processing for the variable where distict count would go more than 50 in NLEVELS or proc sql distict.

 

Thanks,

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @nkm123,

 

I would support @Astounding's approach to rule out character variables which have >50 levels in the first n observations (n << 3.0E7).
If you suspect that the first n obs. are not representative, you could consider drawing a random sample of that size (using PROC SURVEYSELECT).

 

Or you could try the code below, which creates a macro variable (VARLIST) containing the names of the character variables with <=50 non-missing levels. You could use &VARLIST in the TABLES statement of PROC FREQ. Actually, it should even be possible to extend the code to perform the counts. (If you want to include missing values in the counts, you can initialize the arrays t&i with default values, see suggested code.)

 

I have tested the macro with the large (19.6 GB) HAVE dataset (see commented-out data step) with 30 million observations and 70 character variables (of length 10), 35 of which have <=50 levels: c36, c37, ..., c70. Run time (on my workstation): <5 min.

 

/* Create test data */

data have;
set sashelp.cars;
run;

/*
data have;
array c[70] $10;
do i=1 to 3e7;
  do j=1 to 70;
    c[j]=put(int(i/(100000+14200*j)),8.);
  end;
  output;
end;
drop i j;
run; /* 30 million obs., 70 variables, 19.6 GB * /
*/

%macro cntlvl(inlib=WORK, inds=HAVE, cutoff=50);

%local ncv i;

/* Determine number and lengths of character variables */

proc sql noprint;
select length into :len1-:len999
from dictionary.columns
where libname="%upcase(&inlib)" & memname="%upcase(&inds)" & type='char';
quit;

%let ncv=&sqlobs;

/* Determine variables with <=&cutoff distinct non-missing values */

data _null_;
set have end=last;
length varlist $%eval(&ncv*33-1);
array chv _character_;
%do i=1 %to &ncv;
  array t&i[%eval(&cutoff+1)] $&&len&i _temporary_; /* optional (lengths permitting): (%eval(&cutoff+1)*'#default#') */
%end;
array p[&ncv] _temporary_ (&ncv*1); /* first empty position in array t&i */

%do i=1 %to &ncv;
  if p[&i] then do;
    if chv[&i] not in t&i then do;
      t&i[p[&i]]=chv[&i];
      p[&i]+1;
      if p[&i]>%eval(&cutoff+1) then p[&i]=0;
    end;
  end;
%end;

if last then do;
  do i=1 to &ncv;
    if p[i] then varlist=catx(' ', varlist, vname(chv[i]));
  end;
  call symputx('varlist',varlist,'g');
end;
run;

%mend cntlvl;

%cntlvl;

%put &varlist;

View solution in original post

8 REPLIES 8
data_null__
Jade | Level 19

Do you need to restrict PROC FREQ to _CHARACTER_ variables?

 

Do you know how to find out has much memory is allocated for your SAS session?

 

What is your OS?  When I submit to SAS/Grid I can ask for more memory.  -memsize=4g for example.

nkm123
Calcite | Level 5

Thanks for prompt reply. Here is what I have

 

1) Yes only for _CHAR_ variable .

 

2) Here is what I got when use proc option with group = Memory

 

Group=MEMORY

SORTSIZE=268435456

Size parameter for sort

SUMSIZE=0 Upper limit for data-dependent memory usage during summarization

MAXMEMQUERY=268435456

Maximum amount of memory returned when inquiring as to available space

LOADMEMSIZE=0 Suggested memory limit for loaded SAS executables

MEMSIZE=536870912 Specifies the limit on the total amount of memory to be used by the SAS System

REALMEMSIZE=0 Limit on the total amount of real memory to be used by the SAS System

 

3) It's unix OS.

 

Thanks,

 

data_null__
Jade | Level 19

Use 

 

tables _character_ / noprint;

 

and use -MEMSIZE option when you start SAS.

 

That should help you get further along.

 

You may have to more than one PROC FREQ for a range of character variables small enough number to meet memory restrictions.

Astounding
PROC Star

You'll need to know a bit about your data to know if this is acceptable.

 

Run the PROC FREQ using obs=100000, to determine if there are more than 50 levels for each variable.  Then you can use all observations for the final PROC FREQ.

 

The idea is that if there are more than 50 levels, you will see that in the first 100,000 observations.  If your data might be structured differently, of course, this won't work.

Reeza
Super User

SAS Enterprise Miner does a lotof this out of the box, if its an option. You can weight the cost benefit of programming your own versus purchasing the license. 

FreelanceReinh
Jade | Level 19

Hello @nkm123,

 

I would support @Astounding's approach to rule out character variables which have >50 levels in the first n observations (n << 3.0E7).
If you suspect that the first n obs. are not representative, you could consider drawing a random sample of that size (using PROC SURVEYSELECT).

 

Or you could try the code below, which creates a macro variable (VARLIST) containing the names of the character variables with <=50 non-missing levels. You could use &VARLIST in the TABLES statement of PROC FREQ. Actually, it should even be possible to extend the code to perform the counts. (If you want to include missing values in the counts, you can initialize the arrays t&i with default values, see suggested code.)

 

I have tested the macro with the large (19.6 GB) HAVE dataset (see commented-out data step) with 30 million observations and 70 character variables (of length 10), 35 of which have <=50 levels: c36, c37, ..., c70. Run time (on my workstation): <5 min.

 

/* Create test data */

data have;
set sashelp.cars;
run;

/*
data have;
array c[70] $10;
do i=1 to 3e7;
  do j=1 to 70;
    c[j]=put(int(i/(100000+14200*j)),8.);
  end;
  output;
end;
drop i j;
run; /* 30 million obs., 70 variables, 19.6 GB * /
*/

%macro cntlvl(inlib=WORK, inds=HAVE, cutoff=50);

%local ncv i;

/* Determine number and lengths of character variables */

proc sql noprint;
select length into :len1-:len999
from dictionary.columns
where libname="%upcase(&inlib)" & memname="%upcase(&inds)" & type='char';
quit;

%let ncv=&sqlobs;

/* Determine variables with <=&cutoff distinct non-missing values */

data _null_;
set have end=last;
length varlist $%eval(&ncv*33-1);
array chv _character_;
%do i=1 %to &ncv;
  array t&i[%eval(&cutoff+1)] $&&len&i _temporary_; /* optional (lengths permitting): (%eval(&cutoff+1)*'#default#') */
%end;
array p[&ncv] _temporary_ (&ncv*1); /* first empty position in array t&i */

%do i=1 %to &ncv;
  if p[&i] then do;
    if chv[&i] not in t&i then do;
      t&i[p[&i]]=chv[&i];
      p[&i]+1;
      if p[&i]>%eval(&cutoff+1) then p[&i]=0;
    end;
  end;
%end;

if last then do;
  do i=1 to &ncv;
    if p[i] then varlist=catx(' ', varlist, vname(chv[i]));
  end;
  call symputx('varlist',varlist,'g');
end;
run;

%mend cntlvl;

%cntlvl;

%put &varlist;
Ksharp
Super User
If proc freq didn't work out. You could try SQL :



data have;
 set sashelp.class;
run;


data _null_;
 set sashelp.vcolumn(where=(libname='WORK' and memname='HAVE' and type='char')) end=last;
 if _n_=1 then call execute('proc sql;create table want as select ');
 call execute(cat('count(distinct ',name,') as ',name));
 if not last then call execute(',');
  else call execute('from have;quit;');
run;

nkm123
Calcite | Level 5

Thanks everyone for your help. Good learning.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2439 views
  • 1 like
  • 6 in conversation