Hi all,
I am trying to find the variables or group of variables which cause the duplicates for id in the data during extraction. In my original data set, I have 100 variables and 500 duplicates for id. Here is a sample data set:
data test;
input id x1 x2 x3 x4;
datalines;
1 0 1 3 4
1 1 1 3 4
2 2 4 6 8
2 2 6 6 8
3 3 3 3 3
3 3 3 3 3
;I want this data set: because id 1 has only discrepancy for x1 and id 2 has discrepancy for x2 so that would like to get table like this.
| id | x1 | x2 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
| 2 | 2 | 4 |
| 2 | 2 | 6 |
| 3 | 3 | 3 |
| 3 | 3 | 3 |
Thank you for your help!!!
Bikash
You may to create a macro and pass variable to it to find duplicates. I have created a macro to do the same on your dummy table & it's working.
data test;
input id x1 x2 x3 x4;
datalines;
1 0 1 3 4
1 1 1 2 3
2 2 4 6 8
2 2 6 5 5
3 3 3 7 3
3 3 3 3 7
;
run;
proc contents
data = test(drop=id)
noprint
out = data_info
(keep = name varnum);
run;
proc sql;
select name into: name1-
from data_info;
quit;
%put &name1 &name2 &name3 &name4;
%macro find_dup_var(tab=,key=,var=);
proc sql noprint;
create table tab&var. as select count(distinct &var.) as &var._cnt from &tab group by &key;
quit;
proc sql;
select "&var." as duplicate_variable from tab&var. having min(&var._cnt) =1;
quit;
%mend find_dup_var;
%macro run1();
%do i=1 %to 4;
%find_dup_var(tab=test,key=id,var=&&name&i.)
%end;
%mend;
%run1;
Here's an approach for numeric variables. It can be expanded to include character variables as well if that is important.
data _null_;
set have (drop=id);
array nums {*} _numeric_;
call symputx('n_nums', dim(nums));
stop;
run;
data _null_;
if 5=4 then set have (drop=id);
array nums {&n_nums} _numeric_;
array lastv {&n_nums} _temporary_;
array keepyn {&n_nums} $ 1 _temporary_;
set have end=done;
by id;
do _n_=1 to &n_nums;
if first.id=0 and lastv{_n_} ne nums{_n_} then keepyn{_n_} = 'Y';
lastv{_n_} = nums{_n_};
end;
length keeplist $ 20000;
if done;
do _n_=1 to &n_nums;
if keepyn{_n_} = 'Y' then keeplist = catx(' ', keeplist, vname(nums{_n_}) ) ;
end;
call symputx('keeplist', keeplist);
run;
data want;
set have (keep=id &keeplist);
run;
The code is untested, so you will have to try it. That's probably a good idea before trying to expand it to include character variables as well.
Once it is debugged, the major advantage is that it only takes 2 passes through the data to handle everything that needs to be done.
data test;
input id x1 x2 x3 x4;
datalines;
1 0 1 3 4
1 1 1 3 4
2 2 4 6 8
2 2 6 6 8
3 3 3 3 3
3 3 3 3 3
;
run;
proc sql;
create table temp as
select id,count(distinct x1) as x1, count(distinct x2) as x2,
count(distinct x3) as x3,count(distinct x4) as x4
from test
group by id;
quit;
proc transpose data=temp out=temp1;
by id;
run;
proc sql noprint;
select distinct _name_ into : names separated by ','
from temp1
where col1>1;
create table want as
select id,&names from test;
quit;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.