Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Re: Freq of all char variables from dataset where unique count is less...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 03-29-2016 03:22 PM
(2889 views)

I have a dataset with 30MM observations and 100 variables. 70 variables are char and 30 are num. I need to do proc freq for all char variables where distict count of value would be no more than 50 (means where level of char variables would be less than or equal to 50).

I tried following approach:-

1) Proc freq with NLEVELS to get unique count of all char variable so can filter where count is <= 50 and use this list to get proc freq.

problem :- getting memory error so I think I can't use it.

here is code

ods select nlevels;

ods output nlevels=miss_vars(rename = (NLevels = UniqueCount NMissLevels = MissingCount NNonMissLevels = NonMissingUniCount )) ;

ods noresults;

proc freq data=&dsn. nlevels;

run;

ods results;

error:-

Not enough memory for all variables. Variable xxxx removed after 52886 levels.

2) using proc sort and proc freq :-

This gives me what I am looking for but this one takes almost 8 - 10 hrs to complete.

%macro freq_rept (ds=&dsn.,cutoff=50,titleyn=1);

proc contents data=&ds out=freqchk1 noprint;

run;

data _null_;

set freqchk1 (keep=name) end=last;

call symput('var'||trim(left(_n_)),trim(left(name)));

if last then call symput('count',trim(left(_n_)));

run;

%do i=1 %to &count;

proc sort data=&ds out=freqchk2 (keep=&&var&i) nodupkey;

by &&var&i;

run;

data _null_;

set freqchk2 end=last;

if last then do;

if _n_<=&cutoff then flag=1;

else flag=0;

call symput("flag&i",trim(left(flag)));

end;

run;

proc datasets library=work nolist;

delete freqchk2;

quit;

%end;

%let okflag=0;

%do i=1 %to &count;

%if &&flag&i=1 %then %let okflag=1;

%end;

%if &titleyn=1 %then %str(title "&ds";);

%else %str(title;);

%if &okflag=1 %then %do;

proc freq data=&ds;

table %do i=1 %to &count;

%if &&flag&i=1 %then %do;

&&var&i

%end;

%end;;

%end;

proc datasets library=work nolist;

delete freqchk1;

quit;

%mend freq_rept;

3) Using proc sql and distict(var_name) to get distinct value and use them to filter where counts have <= 50 to create var list.

This is also taking time and even some time give me memory error.

Is there any method that would easy to implement and get me freq without taking so much time or without throwing me memory error. I can't change memory size and using base SAS (EG 5.1).

Is there any way to stop processing for the variable where distict count would go more than 50 in NLEVELS or proc sql distict.

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @nkm123,

I would support @Astounding's approach to rule out character variables which have >50 levels in the first n observations (n << 3.0E7).

If you suspect that the *first* n obs. are not representative, you could consider drawing a random sample of that size (using PROC SURVEYSELECT).

Or you could try the code below, which creates a macro variable (VARLIST) containing the names of the character variables with <=50 non-missing levels. You could use &VARLIST in the TABLES statement of PROC FREQ. Actually, it should even be possible to extend the code to perform the counts. (If you want to include missing values in the counts, you can initialize the arrays t&i with default values, see suggested code.)

I have tested the macro with the large (19.6 GB) HAVE dataset (see commented-out data step) with 30 million observations and 70 character variables (of length 10), 35 of which have <=50 levels: c36, c37, ..., c70. Run time (on my workstation): **<5 min**.

```
/* Create test data */
data have;
set sashelp.cars;
run;
/*
data have;
array c[70] $10;
do i=1 to 3e7;
do j=1 to 70;
c[j]=put(int(i/(100000+14200*j)),8.);
end;
output;
end;
drop i j;
run; /* 30 million obs., 70 variables, 19.6 GB * /
*/
%macro cntlvl(inlib=WORK, inds=HAVE, cutoff=50);
%local ncv i;
/* Determine number and lengths of character variables */
proc sql noprint;
select length into :len1-:len999
from dictionary.columns
where libname="%upcase(&inlib)" & memname="%upcase(&inds)" & type='char';
quit;
%let ncv=&sqlobs;
/* Determine variables with <=&cutoff distinct non-missing values */
data _null_;
set have end=last;
length varlist $%eval(&ncv*33-1);
array chv _character_;
%do i=1 %to &ncv;
array t&i[%eval(&cutoff+1)] $&&len&i _temporary_; /* optional (lengths permitting): (%eval(&cutoff+1)*'#default#') */
%end;
array p[&ncv] _temporary_ (&ncv*1); /* first empty position in array t&i */
%do i=1 %to &ncv;
if p[&i] then do;
if chv[&i] not in t&i then do;
t&i[p[&i]]=chv[&i];
p[&i]+1;
if p[&i]>%eval(&cutoff+1) then p[&i]=0;
end;
end;
%end;
if last then do;
do i=1 to &ncv;
if p[i] then varlist=catx(' ', varlist, vname(chv[i]));
end;
call symputx('varlist',varlist,'g');
end;
run;
%mend cntlvl;
%cntlvl;
%put &varlist;
```

8 REPLIES 8

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Do you need to restrict PROC FREQ to _CHARACTER_ variables?

Do you know how to find out has much memory is allocated for your SAS session?

What is your OS? When I submit to SAS/Grid I can ask for more memory. -memsize=4g for example.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for prompt reply. Here is what I have

1) Yes only for _CHAR_ variable .

2) Here is what I got when use proc option with group = Memory

Group=MEMORY

SORTSIZE=268435456

Size parameter for sort

SUMSIZE=0 Upper limit for data-dependent memory usage during summarization

MAXMEMQUERY=268435456

Maximum amount of memory returned when inquiring as to available space

LOADMEMSIZE=0 Suggested memory limit for loaded SAS executables

MEMSIZE=536870912 Specifies the limit on the total amount of memory to be used by the SAS System

REALMEMSIZE=0 Limit on the total amount of real memory to be used by the SAS System

3) It's unix OS.

Thanks,

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Use

tables _character_ / noprint;

and use -MEMSIZE option when you start SAS.

That should help you get further along.

You may have to more than one PROC FREQ for a range of character variables small enough number to meet memory restrictions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You'll need to know a bit about your data to know if this is acceptable.

Run the PROC FREQ using obs=100000, to determine if there are more than 50 levels for each variable. Then you can use all observations for the final PROC FREQ.

The idea is that if there are more than 50 levels, you will see that in the first 100,000 observations. If your data might be structured differently, of course, this won't work.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @nkm123,

I would support @Astounding's approach to rule out character variables which have >50 levels in the first n observations (n << 3.0E7).

If you suspect that the *first* n obs. are not representative, you could consider drawing a random sample of that size (using PROC SURVEYSELECT).

Or you could try the code below, which creates a macro variable (VARLIST) containing the names of the character variables with <=50 non-missing levels. You could use &VARLIST in the TABLES statement of PROC FREQ. Actually, it should even be possible to extend the code to perform the counts. (If you want to include missing values in the counts, you can initialize the arrays t&i with default values, see suggested code.)

I have tested the macro with the large (19.6 GB) HAVE dataset (see commented-out data step) with 30 million observations and 70 character variables (of length 10), 35 of which have <=50 levels: c36, c37, ..., c70. Run time (on my workstation): **<5 min**.

```
/* Create test data */
data have;
set sashelp.cars;
run;
/*
data have;
array c[70] $10;
do i=1 to 3e7;
do j=1 to 70;
c[j]=put(int(i/(100000+14200*j)),8.);
end;
output;
end;
drop i j;
run; /* 30 million obs., 70 variables, 19.6 GB * /
*/
%macro cntlvl(inlib=WORK, inds=HAVE, cutoff=50);
%local ncv i;
/* Determine number and lengths of character variables */
proc sql noprint;
select length into :len1-:len999
from dictionary.columns
where libname="%upcase(&inlib)" & memname="%upcase(&inds)" & type='char';
quit;
%let ncv=&sqlobs;
/* Determine variables with <=&cutoff distinct non-missing values */
data _null_;
set have end=last;
length varlist $%eval(&ncv*33-1);
array chv _character_;
%do i=1 %to &ncv;
array t&i[%eval(&cutoff+1)] $&&len&i _temporary_; /* optional (lengths permitting): (%eval(&cutoff+1)*'#default#') */
%end;
array p[&ncv] _temporary_ (&ncv*1); /* first empty position in array t&i */
%do i=1 %to &ncv;
if p[&i] then do;
if chv[&i] not in t&i then do;
t&i[p[&i]]=chv[&i];
p[&i]+1;
if p[&i]>%eval(&cutoff+1) then p[&i]=0;
end;
end;
%end;
if last then do;
do i=1 to &ncv;
if p[i] then varlist=catx(' ', varlist, vname(chv[i]));
end;
call symputx('varlist',varlist,'g');
end;
run;
%mend cntlvl;
%cntlvl;
%put &varlist;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks everyone for your help. Good learning.

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.