Solved: Re: Finding and removing duplicates

luvscandy27 · Posted 01-09-2020 03:26 PM

Hello Everyone,

I have the data below (an excel dataset is also attached) and I would like to check for duplicates and then remove the duplicates. For example, in the data set below ID 400 has a filedate of May_2019 and two different certify dates. I would like to identify ID’s that have duplicate certifydates and the file date in the same month and then only keep the most recent certify date. So in the data below the certify date for ID 400 would be 10/10/2019. The dataset I am working with has millions of obs. Thank you in advance.

ID filedate certifydate

100 Jan_2019 2/2/2019

200 Mar_2019 5/10/2019

300 Apr_2019 6/7/2019

400 May_2019 8/10/2019

400 May_2019 10/10/2019

500 Jun_2019 11/12/2019

novinosrin · Posted 01-09-2020 04:53 PM

You could output those Filedates having more than one distinct certify date as a dataset with dupes


proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;

And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates

data want dup;
 set have;
 by id filedate ;
 if not (first.filedate and last.filedate) then output dup;
 if last.filedate then output want;
run;

View solution in original post

novinosrin · Posted 01-09-2020 03:32 PM

Hi @luvscandy27 Is it okay to assume your dataset is sorted by ID Filedate Certifydate

If yes

data want;
 do until(last.filedate);
  set have;
  by id filedate notsorted;
 end;
run;

You can basically ignore the NOTSORTED. I just used as it is already grouped by ID Filedate .

luvscandy27 · Posted 01-09-2020 03:37 PM

Yes, it's ok to assume that.

novinosrin · Posted 01-09-2020 03:36 PM

And if you do not expect ties in certifydate within a group of filedate, SQL will do


proc sql;
create table want as
select *
from have
group by id,filedate
having certifydate=max(certifydate);
quit;

luvscandy27 · Posted 01-09-2020 04:48 PM

Thanks everyone! Just one more question, if I wanted to just check the
datset for duplicates how could I do that?

novinosrin · Posted 01-09-2020 04:53 PM

You could output those Filedates having more than one distinct certify date as a dataset with dupes


proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;

And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates

data want dup;
 set have;
 by id filedate ;
 if not (first.filedate and last.filedate) then output dup;
 if last.filedate then output want;
run;

novinosrin · Posted 01-09-2020 03:38 PM

Or simpler

data want;
 set have;
 by id filedate;
 if last.filedate;
run;

JeffMaggio · Posted 01-09-2020 04:19 PM

Why not use hash? 🙂

data _null_;

if _n_ = 1 then do;
    if 0 then set work.HAVE;
    dcl hash uniq(dataset:"work.HAVE", ordered:'A',duplicate:'r');
    uniq.definekey('ID','filedate');
    uniq.definedata('ID','filedate','certifydate');
    uniq.definedone();
end;

uniq.output(dataset:"work.WANT");

stop;

run;

Depending on the size of your dataset and your environment and how many dups you have, this may run much faster than other methods or not at all (only if your dataset is really, really big and your environment doesn't have much memory). The dataset does not have to be sorted for this to work.

Registration is open

SAS Training: Just a Click Away