BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
luvscandy27
Quartz | Level 8

Hello Everyone,

I have the data below (an excel dataset is also attached) and I would like to check for duplicates and then remove the duplicates. For example, in the data set below ID 400 has a filedate of May_2019 and two different certify dates. I would like to identify ID’s that have duplicate certifydates and the file date in the same month and then only keep the most recent certify date. So in the data below the certify date for ID 400 would be 10/10/2019. The dataset I am working with has millions of obs. Thank you in advance.

 

ID            filedate                certifydate

100         Jan_2019             2/2/2019

200         Mar_2019            5/10/2019

300         Apr_2019             6/7/2019

400         May_2019           8/10/2019

400         May_2019           10/10/2019

500         Jun_2019             11/12/2019

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

You could output those Filedates having more than one distinct certify date as a dataset with dupes

 


proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;

And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates

 

data want dup;
 set have;
 by id filedate ;
 if not (first.filedate and last.filedate) then output dup;
 if last.filedate then output want;
run;

View solution in original post

7 REPLIES 7
novinosrin
Tourmaline | Level 20

Hi @luvscandy27  Is it okay to assume your dataset is sorted by ID Filedate Certifydate

 

If yes

 

data want;
 do until(last.filedate);
  set have;
  by id filedate notsorted;
 end;
run;

You can basically ignore the NOTSORTED. I just used as it is already grouped by ID Filedate . 

luvscandy27
Quartz | Level 8
Yes, it's ok to assume that.
novinosrin
Tourmaline | Level 20

And if you do not expect ties in certifydate within a group of filedate, SQL will do

 


proc sql;
create table want as
select *
from have
group by id,filedate
having certifydate=max(certifydate);
quit;
luvscandy27
Quartz | Level 8
Thanks everyone! Just one more question, if I wanted to just check the
datset for duplicates how could I do that?
novinosrin
Tourmaline | Level 20

You could output those Filedates having more than one distinct certify date as a dataset with dupes

 


proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;

And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates

 

data want dup;
 set have;
 by id filedate ;
 if not (first.filedate and last.filedate) then output dup;
 if last.filedate then output want;
run;
novinosrin
Tourmaline | Level 20

Or simpler 

 

data want;
 set have;
 by id filedate;
 if last.filedate;
run;
JeffMaggio
Obsidian | Level 7

Why not use hash? 🙂

 

data _null_;

if _n_ = 1 then do;
    if 0 then set work.HAVE;
    dcl hash uniq(dataset:"work.HAVE", ordered:'A',duplicate:'r');
    uniq.definekey('ID','filedate');
    uniq.definedata('ID','filedate','certifydate');
    uniq.definedone();
end;

uniq.output(dataset:"work.WANT");

stop;

run;

Depending on the size of your dataset and your environment and how many dups you have, this may run much faster than other methods or not at all (only if your dataset is really, really big and your environment doesn't have much memory). The dataset does not have to be sorted for this to work.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1171 views
  • 6 likes
  • 3 in conversation