Hello Everyone,
I have the data below (an excel dataset is also attached) and I would like to check for duplicates and then remove the duplicates. For example, in the data set below ID 400 has a filedate of May_2019 and two different certify dates. I would like to identify ID’s that have duplicate certifydates and the file date in the same month and then only keep the most recent certify date. So in the data below the certify date for ID 400 would be 10/10/2019. The dataset I am working with has millions of obs. Thank you in advance.
ID filedate certifydate
100 Jan_2019 2/2/2019
200 Mar_2019 5/10/2019
300 Apr_2019 6/7/2019
400 May_2019 8/10/2019
400 May_2019 10/10/2019
500 Jun_2019 11/12/2019
You could output those Filedates having more than one distinct certify date as a dataset with dupes
proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;
And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates
data want dup;
set have;
by id filedate ;
if not (first.filedate and last.filedate) then output dup;
if last.filedate then output want;
run;
Hi @luvscandy27 Is it okay to assume your dataset is sorted by ID Filedate Certifydate
If yes
data want;
do until(last.filedate);
set have;
by id filedate notsorted;
end;
run;
You can basically ignore the NOTSORTED. I just used as it is already grouped by ID Filedate .
And if you do not expect ties in certifydate within a group of filedate, SQL will do
proc sql;
create table want as
select *
from have
group by id,filedate
having certifydate=max(certifydate);
quit;
You could output those Filedates having more than one distinct certify date as a dataset with dupes
proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;
And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates
data want dup;
set have;
by id filedate ;
if not (first.filedate and last.filedate) then output dup;
if last.filedate then output want;
run;
Or simpler
data want;
set have;
by id filedate;
if last.filedate;
run;
Why not use hash? 🙂
data _null_;
if _n_ = 1 then do;
if 0 then set work.HAVE;
dcl hash uniq(dataset:"work.HAVE", ordered:'A',duplicate:'r');
uniq.definekey('ID','filedate');
uniq.definedata('ID','filedate','certifydate');
uniq.definedone();
end;
uniq.output(dataset:"work.WANT");
stop;
run;
Depending on the size of your dataset and your environment and how many dups you have, this may run much faster than other methods or not at all (only if your dataset is really, really big and your environment doesn't have much memory). The dataset does not have to be sorted for this to work.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.