Re: how to find the duplicated records with two combined IDS in sas

juliajulia · Posted 07-13-2021 12:46 PM

i have a dataset that has two ids: id1 and id2. i want to find out if there are any duplicated records for the combined id1 and id2. i created a code like this but it don't work. are there any easy way to do this in sas proc sql?

data combine;
set dataset;
combinedid=catx(id1, id2);
run;

PROC SQL;
SELECT id1, id2,
FREQ(combinedid) AS dupe
FROM dataset
GROUP BY combinedid
HAVING dupe GE 2;
quit;

Kurt_Bremser · Posted 07-13-2021 12:51 PM

Run a count and use HAVING:

proc sql;
select id1, id2, count(*) as count
from have
group by id1, id2
having calculated count > 1;
quit;

Or run a proc freq or proc summary and filter out the observations with a count > 1.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 07-13-2021 01:00 PM

Why use SQL at all? Why not use PROC SORT which has built in functions to help identify duplicates. Look at the NODUPKEY, NOUNIQUEKEY, DUPOUT, UNIQUEOUT options.
It can identify duplicates across multiple columns and easily separate them into their own data sets with no need to combine anything.

You can definitely roll your own via SQL but it's faster, easier and more efficient to use the developed procedures.

Reeza · Posted 07-13-2021 01:02 PM

And you may want to sort those IDs first as well.

ie should these be duplicates:

ID1 ID2
ABC DEF
DEF ABC

None of the posted solutions will deal with this scenario.

juliajulia · Posted 07-13-2021 01:24 PM

Thank you all. Both proc sql and proc freq works with my case:

proc sql;
select id1, id2, count(*) as count
from have
group by id1, id2
having calculated count > 1;
quit;

PROC FREQ data=have;
TABLES id1*id2 / noprint out=duplist;
RUN;
PROC PRINT data=duplist;
WHERE count ge 2;
RUN;

how to find the duplicated records with two combined IDS in sas