BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ZachLandone
Calcite | Level 5

I have a very large program with approximately 40 variables. For the sake of argument, let's call them var1-var40.

 

I need to know how many duplicates there are for var1, var1 and var2, var1 var2 and var3, var1 var 2 var3 and var4, ........., var1 var2 var3 ...... var 40.

 

Is there a way to create this array without having to manually type it, and then recursively run it through a proc sort or proc SQL command so that I can see how many duplicates there are side by side for each of the combinations above.

 

Currently, I am running 40 proc SQL statements, following by 40 proc freq commands on the individual outputs and manually checking them against each other.

 

I feel there must be a way to incorporate an array with each variable combination into a do loop that automatically generates and stores the number of detected duplicates.

 

Bonus points if the above code could show me which observations are dropping at each point as the combinations increase in variable size.

 

Thanks in advance,

Zach

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
 set sashelp.heart;
run;




data _null_;
 set sashelp.vcolumn(keep=libname memname name where=(libname='WORK' and memname='HAVE'));
 length list_vars $ 4000;
 retain list_vars;
 list_vars=catx(' ',list_vars,name); putlog list_vars= ;
 call execute(cat('proc sort data=have out=dummy dupout=_dup_',_n_,' nodupkey; by ',list_vars,';run;'));
 call execute(catt('data _dup_',_n_,'; set _dup_',_n_,';length from_vars $ 2000; from_vars="',list_vars,'";run;'));
run;
data all ;
 set _dup_:;
run;
proc sql;
create table want as
select from_vars,count(*) as n_duplicate
 from all
  group by from_vars;
quit;

View solution in original post

7 REPLIES 7
Astounding
PROC Star
What about other combinations such as var2 and var4, or var5 and var10 and var15? You may not care about those but if you do, you may not live long enough to inspect all the output. There would be over a trillion combinations.
ZachLandone
Calcite | Level 5

Yeah, I'm not interested in all possible combinations. Just the 40 combinations that describe that pattern.

Astounding
PROC Star

OK, moving in that direction ... the form of the program you need isn't 100% clear.  Here's something similar that will probably give you some ideas about how to proceed.  Feel free to come back and ask questions for more detail.

 

Macro language can get you a set of tables.  It types out the statements for you, instead of making you do it.  For example, here are 100 PROC FREQ tables:

proc freq data=have noprint;
 
%macro tables40;
   %local i j;
   %do i=1 %to 40;
      tables var1
      %do j=2 %to &i;
         * var&j
      %end;
      / out=freq&i (drop=percent where=(count > 1));
   %end;
%mend tables40;

%tables40
run;

It's untested code, but looks about right.  There could be memory problems with 40 variables going into the same table, so we may have to cross bridges as we come to them.

andreas_lds
Jade | Level 19

Not clear what you expect as result or what the existing code actually does, so please post data in usable form, so that we see what you have. Also add the expected result for that data and the code (just one proc sql + freq) you have.

Ksharp
Super User
data have;
 set sashelp.heart;
run;




data _null_;
 set sashelp.vcolumn(keep=libname memname name where=(libname='WORK' and memname='HAVE'));
 length list_vars $ 4000;
 retain list_vars;
 list_vars=catx(' ',list_vars,name); putlog list_vars= ;
 call execute(cat('proc sort data=have out=dummy dupout=_dup_',_n_,' nodupkey; by ',list_vars,';run;'));
 call execute(catt('data _dup_',_n_,'; set _dup_',_n_,';length from_vars $ 2000; from_vars="',list_vars,'";run;'));
run;
data all ;
 set _dup_:;
run;
proc sql;
create table want as
select from_vars,count(*) as n_duplicate
 from all
  group by from_vars;
quit;
ZachLandone
Calcite | Level 5
Works like a charm! Thanks
Astounding
PROC Star

OK, your objective is becoming clearer.  Let's start with sorting the data set once instead of 40 times:

 

proc sort data=have;
   by var1-var40;
run;

That step lets you use any of these BY statements later:

by var1;

by var11 var2;

by var1 var2 var3;

How you proceed after sorting ... well I'll show the simpler variation that outputs all the duplicates.  Counting the duplicates is not much more difficult but let's start with subsetting:

 

data dup_01 dup_02 dup03;
   set have;
   by var1-var3;
   if first.var1=0 or last.var1=0 then output dup_01;
   if first.var2=0 or last.var2=0 then output dup_02;
   if first.var3=0 or last.var3=0 then output dup_03;
run;

And you get all this processing your data set twice, not 80 times.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1330 views
  • 4 likes
  • 4 in conversation