topic Finding and removing duplicates in SAS Programming

Finding and removing duplicates

luvscandy27 — Thu, 09 Jan 2020 20:26:53 GMT

Hello Everyone,

I have the data below (an excel dataset is also attached) and I would like to check for duplicates and then remove the duplicates. For example, in the data set below ID 400 has a filedate of May_2019 and two different certify dates. I would like to identify ID’s that have duplicate certifydates and the file date in the same month and then only keep the most recent certify date. So in the data below the certify date for ID 400 would be 10/10/2019. The dataset I am working with has millions of obs. Thank you in advance.

ID filedate certifydate

100 Jan_2019 2/2/2019

200 Mar_2019 5/10/2019

300 Apr_2019 6/7/2019

400 May_2019 8/10/2019

400 May_2019 10/10/2019

500 Jun_2019 11/12/2019

Re: Finding and removing duplicates

novinosrin — Thu, 09 Jan 2020 20:35:19 GMT

Hi @luvscandy27 Is it okay to assume your dataset is sorted by ID Filedate Certifydate

If yes

data want;
 do until(last.filedate);
  set have;
  by id filedate notsorted;
 end;
run;

You can basically ignore the NOTSORTED. I just used as it is already grouped by ID Filedate .

Re: Finding and removing duplicates

novinosrin — Thu, 09 Jan 2020 20:36:43 GMT

And if you do not expect ties in certifydate within a group of filedate, SQL will do


proc sql;
create table want as
select *
from have
group by id,filedate
having certifydate=max(certifydate);
quit;

Re: Finding and removing duplicates

luvscandy27 — Thu, 09 Jan 2020 20:37:05 GMT

Yes, it's ok to assume that.

Re: Finding and removing duplicates

novinosrin — Thu, 09 Jan 2020 20:38:27 GMT

Or simpler

data want;
 set have;
 by id filedate;
 if last.filedate;
run;

Re: Finding and removing duplicates

JeffMaggio — Thu, 09 Jan 2020 21:30:27 GMT

Why not use hash? 🙂

data _null_;

if _n_ = 1 then do;
    if 0 then set work.HAVE;
    dcl hash uniq(dataset:"work.HAVE", ordered:'A',duplicate:'r');
    uniq.definekey('ID','filedate');
    uniq.definedata('ID','filedate','certifydate');
    uniq.definedone();
end;

uniq.output(dataset:"work.WANT");

stop;

run;

Depending on the size of your dataset and your environment and how many dups you have, this may run much faster than other methods or not at all (only if your dataset is really, really big and your environment doesn't have much memory). The dataset does not have to be sorted for this to work.

Re: Finding and removing duplicates

luvscandy27 — Thu, 09 Jan 2020 21:48:05 GMT

Thanks everyone! Just one more question, if I wanted to just check the
datset for duplicates how could I do that?

Re: Finding and removing duplicates

novinosrin — Thu, 09 Jan 2020 22:09:23 GMT

You could output those Filedates having more than one distinct certify date as a dataset with dupes


proc sql;
create table dupes as
select *
from have
group by id,filedate
having count(distinct certifydate)>1;
quit;

And assuming, if there are more than certifydate within a fiiledate and those certifydates will be distinct, the following datastep would output 2 datasets i.e one with keeping your most recent certifydate and the other with the duplicates

data want dup;
 set have;
 by id filedate ;
 if not (first.filedate and last.filedate) then output dup;
 if last.filedate then output want;
run;