I have a very large program with approximately 40 variables. For the sake of argument, let's call them var1-var40.
I need to know how many duplicates there are for var1, var1 and var2, var1 var2 and var3, var1 var 2 var3 and var4, ........., var1 var2 var3 ...... var 40.
Is there a way to create this array without having to manually type it, and then recursively run it through a proc sort or proc SQL command so that I can see how many duplicates there are side by side for each of the combinations above.
Currently, I am running 40 proc SQL statements, following by 40 proc freq commands on the individual outputs and manually checking them against each other.
I feel there must be a way to incorporate an array with each variable combination into a do loop that automatically generates and stores the number of detected duplicates.
Bonus points if the above code could show me which observations are dropping at each point as the combinations increase in variable size.
Thanks in advance,
Zach
data have;
set sashelp.heart;
run;
data _null_;
set sashelp.vcolumn(keep=libname memname name where=(libname='WORK' and memname='HAVE'));
length list_vars $ 4000;
retain list_vars;
list_vars=catx(' ',list_vars,name); putlog list_vars= ;
call execute(cat('proc sort data=have out=dummy dupout=_dup_',_n_,' nodupkey; by ',list_vars,';run;'));
call execute(catt('data _dup_',_n_,'; set _dup_',_n_,';length from_vars $ 2000; from_vars="',list_vars,'";run;'));
run;
data all ;
set _dup_:;
run;
proc sql;
create table want as
select from_vars,count(*) as n_duplicate
from all
group by from_vars;
quit;
Yeah, I'm not interested in all possible combinations. Just the 40 combinations that describe that pattern.
OK, moving in that direction ... the form of the program you need isn't 100% clear. Here's something similar that will probably give you some ideas about how to proceed. Feel free to come back and ask questions for more detail.
Macro language can get you a set of tables. It types out the statements for you, instead of making you do it. For example, here are 100 PROC FREQ tables:
proc freq data=have noprint;
%macro tables40;
%local i j;
%do i=1 %to 40;
tables var1
%do j=2 %to &i;
* var&j
%end;
/ out=freq&i (drop=percent where=(count > 1));
%end;
%mend tables40;
%tables40
run;
It's untested code, but looks about right. There could be memory problems with 40 variables going into the same table, so we may have to cross bridges as we come to them.
Not clear what you expect as result or what the existing code actually does, so please post data in usable form, so that we see what you have. Also add the expected result for that data and the code (just one proc sql + freq) you have.
data have;
set sashelp.heart;
run;
data _null_;
set sashelp.vcolumn(keep=libname memname name where=(libname='WORK' and memname='HAVE'));
length list_vars $ 4000;
retain list_vars;
list_vars=catx(' ',list_vars,name); putlog list_vars= ;
call execute(cat('proc sort data=have out=dummy dupout=_dup_',_n_,' nodupkey; by ',list_vars,';run;'));
call execute(catt('data _dup_',_n_,'; set _dup_',_n_,';length from_vars $ 2000; from_vars="',list_vars,'";run;'));
run;
data all ;
set _dup_:;
run;
proc sql;
create table want as
select from_vars,count(*) as n_duplicate
from all
group by from_vars;
quit;
OK, your objective is becoming clearer. Let's start with sorting the data set once instead of 40 times:
proc sort data=have;
by var1-var40;
run;
That step lets you use any of these BY statements later:
by var1;
by var11 var2;
by var1 var2 var3;
How you proceed after sorting ... well I'll show the simpler variation that outputs all the duplicates. Counting the duplicates is not much more difficult but let's start with subsetting:
data dup_01 dup_02 dup03;
set have;
by var1-var3;
if first.var1=0 or last.var1=0 then output dup_01;
if first.var2=0 or last.var2=0 then output dup_02;
if first.var3=0 or last.var3=0 then output dup_03;
run;
And you get all this processing your data set twice, not 80 times.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.