DATA Step, Macro, Functions and more

Freq of all char variables from dataset where unique count is less than or equal to 50

Accepted Solution Solved
Reply
Contributor
Posts: 30
Accepted Solution

Freq of all char variables from dataset where unique count is less than or equal to 50

[ Edited ]

I have a dataset with 30MM observations and 100 variables. 70 variables are char and 30 are num. I need to do proc freq for all char variables where distict count of value would be no more than 50 (means where level of char variables would be less than or equal to 50).

 

I tried following approach:-

 

1) Proc freq with NLEVELS to get unique count of all char variable so can filter where count is <= 50 and use this list to get proc freq.

 

problem :- getting memory error so I think I can't use it.

 

here is code

 

ods select nlevels;

ods output nlevels=miss_vars(rename = (NLevels = UniqueCount NMissLevels = MissingCount NNonMissLevels = NonMissingUniCount )) ;

ods noresults;

proc freq data=&dsn. nlevels;

run;

ods results;

 

error:-

Not enough memory for all variables. Variable xxxx removed after 52886 levels.

 

2) using proc sort and proc freq :-

 

This gives me what I am looking for but this one takes almost 8 - 10 hrs to complete.

 

%macro freq_rept (ds=&dsn.,cutoff=50,titleyn=1);

proc contents data=&ds out=freqchk1 noprint;

run;

data _null_;

set freqchk1 (keep=name) end=last;

call symput('var'||trim(left(_n_)),trim(left(name)));

if last then call symput('count',trim(left(_n_)));

run;

%do i=1 %to &count;

proc sort data=&ds out=freqchk2 (keep=&&var&i) nodupkey;

by &&var&i;

run;

data _null_;

set freqchk2 end=last;

if last then do;

if _n_<=&cutoff then flag=1;

else flag=0;

call symput("flag&i",trim(left(flag)));

end;

run;

proc datasets library=work nolist;

delete freqchk2;

quit;

%end;

%let okflag=0;

%do i=1 %to &count;

%if &&flag&i=1 %then %let okflag=1;

%end;

%if &titleyn=1 %then %str(title "&ds"Smiley Wink;

%else %str(titleSmiley Wink;

%if &okflag=1 %then %do;

proc freq data=&ds;

table %do i=1 %to &count;

%if &&flag&i=1 %then %do;

&&var&i

%end;

%end;;

%end;

proc datasets library=work nolist;

delete freqchk1;

quit;

%mend freq_rept;

 

3) Using proc sql and distict(var_name) to get distinct value and use them to filter where counts have <= 50 to create var list.

 

This is also taking time and even some time give me memory error.

 

Is there any method that would easy to implement and  get me freq without taking so much time or without throwing me memory error. I can't change memory size and using base SAS (EG 5.1).

 

Is there any way to stop processing for the variable where distict count would go more than 50 in NLEVELS or proc sql distict.

 

Thanks,

 

 

 


Accepted Solutions
Solution
‎03-30-2016 11:11 PM
Trusted Advisor
Posts: 1,115

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

Hello @nkm123,

 

I would support @Astounding's approach to rule out character variables which have >50 levels in the first n observations (n << 3.0E7).
If you suspect that the first n obs. are not representative, you could consider drawing a random sample of that size (using PROC SURVEYSELECT).

 

Or you could try the code below, which creates a macro variable (VARLIST) containing the names of the character variables with <=50 non-missing levels. You could use &VARLIST in the TABLES statement of PROC FREQ. Actually, it should even be possible to extend the code to perform the counts. (If you want to include missing values in the counts, you can initialize the arrays t&i with default values, see suggested code.)

 

I have tested the macro with the large (19.6 GB) HAVE dataset (see commented-out data step) with 30 million observations and 70 character variables (of length 10), 35 of which have <=50 levels: c36, c37, ..., c70. Run time (on my workstation): <5 min.

 

/* Create test data */

data have;
set sashelp.cars;
run;

/*
data have;
array c[70] $10;
do i=1 to 3e7;
  do j=1 to 70;
    c[j]=put(int(i/(100000+14200*j)),8.);
  end;
  output;
end;
drop i j;
run; /* 30 million obs., 70 variables, 19.6 GB * /
*/

%macro cntlvl(inlib=WORK, inds=HAVE, cutoff=50);

%local ncv i;

/* Determine number and lengths of character variables */

proc sql noprint;
select length into :len1-:len999
from dictionary.columns
where libname="%upcase(&inlib)" & memname="%upcase(&inds)" & type='char';
quit;

%let ncv=&sqlobs;

/* Determine variables with <=&cutoff distinct non-missing values */

data _null_;
set have end=last;
length varlist $%eval(&ncv*33-1);
array chv _character_;
%do i=1 %to &ncv;
  array t&i[%eval(&cutoff+1)] $&&len&i _temporary_; /* optional (lengths permitting): (%eval(&cutoff+1)*'#default#') */
%end;
array p[&ncv] _temporary_ (&ncv*1); /* first empty position in array t&i */

%do i=1 %to &ncv;
  if p[&i] then do;
    if chv[&i] not in t&i then do;
      t&i[p[&i]]=chv[&i];
      p[&i]+1;
      if p[&i]>%eval(&cutoff+1) then p[&i]=0;
    end;
  end;
%end;

if last then do;
  do i=1 to &ncv;
    if p[i] then varlist=catx(' ', varlist, vname(chv[i]));
  end;
  call symputx('varlist',varlist,'g');
end;
run;

%mend cntlvl;

%cntlvl;

%put &varlist;

View solution in original post


All Replies
Respected Advisor
Posts: 3,777

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

Do you need to restrict PROC FREQ to _CHARACTER_ variables?

 

Do you know how to find out has much memory is allocated for your SAS session?

 

What is your OS?  When I submit to SAS/Grid I can ask for more memory.  -memsize=4g for example.

Contributor
Posts: 30

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

Thanks for prompt reply. Here is what I have

 

1) Yes only for _CHAR_ variable .

 

2) Here is what I got when use proc option with group = Memory

 

Group=MEMORY

SORTSIZE=268435456

Size parameter for sort

SUMSIZE=0 Upper limit for data-dependent memory usage during summarization

MAXMEMQUERY=268435456

Maximum amount of memory returned when inquiring as to available space

LOADMEMSIZE=0 Suggested memory limit for loaded SAS executables

MEMSIZE=536870912 Specifies the limit on the total amount of memory to be used by the SAS System

REALMEMSIZE=0 Limit on the total amount of real memory to be used by the SAS System

 

3) It's unix OS.

 

Thanks,

 

Respected Advisor
Posts: 3,777

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

Use 

 

tables _character_ / noprint;

 

and use -MEMSIZE option when you start SAS.

 

That should help you get further along.

 

You may have to more than one PROC FREQ for a range of character variables small enough number to meet memory restrictions.

Super User
Posts: 5,092

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

You'll need to know a bit about your data to know if this is acceptable.

 

Run the PROC FREQ using obs=100000, to determine if there are more than 50 levels for each variable.  Then you can use all observations for the final PROC FREQ.

 

The idea is that if there are more than 50 levels, you will see that in the first 100,000 observations.  If your data might be structured differently, of course, this won't work.

Super User
Posts: 17,907

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

SAS Enterprise Miner does a lotof this out of the box, if its an option. You can weight the cost benefit of programming your own versus purchasing the license. 

Solution
‎03-30-2016 11:11 PM
Trusted Advisor
Posts: 1,115

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

Hello @nkm123,

 

I would support @Astounding's approach to rule out character variables which have >50 levels in the first n observations (n << 3.0E7).
If you suspect that the first n obs. are not representative, you could consider drawing a random sample of that size (using PROC SURVEYSELECT).

 

Or you could try the code below, which creates a macro variable (VARLIST) containing the names of the character variables with <=50 non-missing levels. You could use &VARLIST in the TABLES statement of PROC FREQ. Actually, it should even be possible to extend the code to perform the counts. (If you want to include missing values in the counts, you can initialize the arrays t&i with default values, see suggested code.)

 

I have tested the macro with the large (19.6 GB) HAVE dataset (see commented-out data step) with 30 million observations and 70 character variables (of length 10), 35 of which have <=50 levels: c36, c37, ..., c70. Run time (on my workstation): <5 min.

 

/* Create test data */

data have;
set sashelp.cars;
run;

/*
data have;
array c[70] $10;
do i=1 to 3e7;
  do j=1 to 70;
    c[j]=put(int(i/(100000+14200*j)),8.);
  end;
  output;
end;
drop i j;
run; /* 30 million obs., 70 variables, 19.6 GB * /
*/

%macro cntlvl(inlib=WORK, inds=HAVE, cutoff=50);

%local ncv i;

/* Determine number and lengths of character variables */

proc sql noprint;
select length into :len1-:len999
from dictionary.columns
where libname="%upcase(&inlib)" & memname="%upcase(&inds)" & type='char';
quit;

%let ncv=&sqlobs;

/* Determine variables with <=&cutoff distinct non-missing values */

data _null_;
set have end=last;
length varlist $%eval(&ncv*33-1);
array chv _character_;
%do i=1 %to &ncv;
  array t&i[%eval(&cutoff+1)] $&&len&i _temporary_; /* optional (lengths permitting): (%eval(&cutoff+1)*'#default#') */
%end;
array p[&ncv] _temporary_ (&ncv*1); /* first empty position in array t&i */

%do i=1 %to &ncv;
  if p[&i] then do;
    if chv[&i] not in t&i then do;
      t&i[p[&i]]=chv[&i];
      p[&i]+1;
      if p[&i]>%eval(&cutoff+1) then p[&i]=0;
    end;
  end;
%end;

if last then do;
  do i=1 to &ncv;
    if p[i] then varlist=catx(' ', varlist, vname(chv[i]));
  end;
  call symputx('varlist',varlist,'g');
end;
run;

%mend cntlvl;

%cntlvl;

%put &varlist;
Super User
Posts: 9,687

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

If proc freq didn't work out. You could try SQL :



data have;
 set sashelp.class;
run;


data _null_;
 set sashelp.vcolumn(where=(libname='WORK' and memname='HAVE' and type='char')) end=last;
 if _n_=1 then call execute('proc sql;create table want as select ');
 call execute(cat('count(distinct ',name,') as ',name));
 if not last then call execute(',');
  else call execute('from have;quit;');
run;

Contributor
Posts: 30

Re: Freq of all char variables from dataset where unique count is less than or equal to 50

Thanks everyone for your help. Good learning.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 460 views
  • 0 likes
  • 6 in conversation