BookmarkSubscribeRSS Feed
bikashten
Fluorite | Level 6

Hi all,

I am trying to find the variables or group of variables which cause the duplicates for id in the data during extraction. In my original data set, I have 100 variables and 500 duplicates for id. Here is a sample data set:

data test;
input id x1 x2 x3 x4;
datalines;
1 0 1 3 4
1 1 1 3 4
2 2 4 6 8
2 2 6 6 8 
3 3 3 3 3 
3 3 3 3 3 

;

I want this data set: because id 1 has only discrepancy for x1 and id 2 has discrepancy for x2 so that would like to get table like this. 

idx1x2
101
111
224
226
333
333

 

Thank you for your help!!!

 

Bikash

3 REPLIES 3
kulbshar
Calcite | Level 5

You may to create a macro and pass variable to it to find duplicates. I have created a macro to do the same on your dummy table & it's working.

 

data test;
input id x1 x2 x3 x4;
datalines;
1 0 1 3 4
1 1 1 2 3
2 2 4 6 8
2 2 6 5 5
3 3 3 7 3
3 3 3 3 7

;
run;


proc contents
data = test(drop=id)
noprint
out = data_info
(keep = name varnum);
run;

 

proc sql;
select name into: name1-
from data_info;
quit;

%put &name1 &name2 &name3 &name4;


%macro find_dup_var(tab=,key=,var=);
proc sql noprint;
create table tab&var. as select count(distinct &var.) as &var._cnt from &tab group by &key;
quit;

proc sql;
select "&var." as duplicate_variable from tab&var. having min(&var._cnt) =1;
quit;

%mend find_dup_var;

%macro run1();
%do i=1 %to 4;
%find_dup_var(tab=test,key=id,var=&&name&i.)
%end;

%mend;

%run1;

Astounding
PROC Star

Here's an approach for numeric variables.  It can be expanded to include character variables as well if that is important.

 

data _null_;
   set have (drop=id);
   array nums {*} _numeric_;
   call symputx('n_nums', dim(nums));
   stop;
run;

data _null_;
   if 5=4 then set have (drop=id);
   array nums {&n_nums} _numeric_;
   array lastv {&n_nums} _temporary_;
   array keepyn {&n_nums} $ 1 _temporary_;
   set have end=done;
   by id;
   do _n_=1 to &n_nums;
      if first.id=0 and lastv{_n_} ne nums{_n_} then keepyn{_n_} = 'Y';
      lastv{_n_} = nums{_n_};
   end;
   length keeplist $ 20000;
   if done;
   do _n_=1 to &n_nums;
      if keepyn{_n_} = 'Y' then keeplist = catx(' ', keeplist, vname(nums{_n_}) ) ;
   end;
   call symputx('keeplist', keeplist);
run;

data want;
   set have (keep=id &keeplist);
run;

The code is untested, so you will have to try it.  That's probably a good idea before trying to expand it to include character variables as well.  

 

Once it is debugged, the major advantage is that it only takes 2 passes through the data to handle everything that needs to be done.

Ksharp
Super User
data test;
input id x1 x2 x3 x4;
datalines;
1 0 1 3 4
1 1 1 3 4
2 2 4 6 8
2 2 6 6 8 
3 3 3 3 3 
3 3 3 3 3 
;
run;
proc sql;
create table temp as
 select id,count(distinct x1) as x1, count(distinct x2) as x2,
 count(distinct x3) as x3,count(distinct x4) as x4
  from test 
   group by id;
quit;

proc transpose data=temp out=temp1;
by id;
run;
proc sql noprint;
select distinct _name_ into : names separated by ','
from temp1
 where col1>1;

create table want as
 select id,&names from test;
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 434 views
  • 0 likes
  • 4 in conversation