Re: How to ignore missing observations as duplicates AND keep selectiv...

phdstud2020 · Posted 02-16-2022 08:07 PM

Hello,

I am using SAS version 9.4

I have a missing data/duplicates problem. I have merged two datasets in which one date (date2) precedes the other (date1) within a certain window. Note that date1 and date2 are from different datasets. My matching criteria leads some observations to match date2 on the incorrect date1, creating an interesting duplicate data problem.

Edit to include my merge procedure:

proc sql;
create table have as
select a.*, b.*
from have1 as a
left join have2 as b
on a.id=b.id and (a.date1>b.date2>a.date3); *date3 & date2 share the same data source. I omit date3 for brevity;
quit;

What I have:

id	date1	date2
1	1/1/2008
1	1/1/2009
1	1/1/2010	1/1/2009
1	1/1/2011	1/1/2009
1	1/1/2011	1/1/2010

I would like to remove the 4th observation only.

edit for clarity: I want to remove the 4th observation in the above example, but as a rule in my full sample.

data have;
id date1 date2;
datalines;
a 1/1/2008 .
a 1/1/2009 .
a 1/1/2010 1/1/2009
a 1/1/2011 1/1/2009
a 1/1/2011 1/1/2010
;
run;

What I want:

id	date1	date2
1	1/1/2008
1	1/1/2009
1	1/1/2010	1/1/2009
1	1/1/2011	1/1/2010

What happens:

id	date1	date2
1	1/1/2011	1/1/2009
1	1/1/2011	1/1/2010

In summary, I need to remove only non-missing duplicates BY id date2, and I need to keep the observation in which date1 is the closest to date2. I have tried sorting by descending order, but it only keeps the date2 which is further away from date1.

I scanned the board before posting this, but I didn't see anything quite as unique as my issue. I appreciate any guidance you all can provide. Please let me know if I can improve my post at all.

Thank you

ballardw · Posted 02-16-2022 08:27 PM

Perhaps you could show how you merged the data so we, may be able to, show how to prevent the duplicates.

You need to state a general rule for exactly how to identify which are "duplicates" that need to removed in terms of data values.

I can tell you how to remove a "4th record" very easily. But unless your data is very small the approach would not be very easy to do.

data want;
   set have;
   if _n_ = 4 then delete;
run;

is one of several ways to remove one specific record.

phdstud2020 · Posted 02-16-2022 08:55 PM

Thank you. I updated my post to include my merge procedure and clarify my "4th observation" comment. I am currently removing duplicates by id & date2

Tom · Posted 02-16-2022 10:41 PM

If there are no other variables you want to keep then just group by ID and DATE2 and take MIN(DATE1) as your DATE values. If there are other variables you are keeping then you could try using a HAVING() clause, but unless you have unique dates from both datasets you cannot do it directly in SQL code as you could still get multiple observations if there are two DATE1 values that are the same in the same ID , DATE2 group.

So just use a data step. So if you have this data, sorted by ID DATE2 and DATE1 .

data have;
  id date1 :mmddyy. date2 :mmddyy.;
  format date1 date2 mmddyy10.;
datalines;
a 1/1/2008 .
a 1/1/2009 .
a 1/1/2010 1/1/2009
a 1/1/2011 1/1/2009
a 1/1/2011 1/1/2010
;

Then run this step to reduce to the first observation per ID and DATE1.

data want;
  set have;
  by id date2 date1;
  if first.date2;
run;

How to ignore missing observations as duplicates AND keep selective observations?

Re: How to ignore missing observations as duplicates AND keep selective observations?

Re: How to ignore missing observations as duplicates AND keep selective observations?

Re: How to ignore missing observations as duplicates AND keep selective observations?

How to ignore missing observations as duplicates AND keep selective observations?

Re: How to ignore missing observations as duplicates AND keep selective observations?

Re: How to ignore missing observations as duplicates AND keep selective observations?

Re: How to ignore missing observations as duplicates AND keep selective observations?

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!