BookmarkSubscribeRSS Feed
phdstud2020
Calcite | Level 5

Hello,

 

I am using SAS version 9.4

 

I have a missing data/duplicates problem. I have merged two datasets in which one date (date2) precedes the other (date1) within a certain window. Note that date1 and date2 are from different datasets. My matching criteria leads some observations to match date2 on the incorrect date1, creating an interesting duplicate data problem.

 

Edit to include my merge procedure:

proc sql;
create table have as
select a.*, b.*
from have1 as a
left join have2 as b
on a.id=b.id and (a.date1>b.date2>a.date3); *date3 & date2 share the same data source. I omit date3 for brevity;
quit;

 

What I have:

id

date1

date2

1

1/1/2008

 

1

1/1/2009

 

1

1/1/2010

1/1/2009

1

1/1/2011

1/1/2009

1

1/1/2011

1/1/2010

I would like to remove the 4th observation only.

edit for clarity: I want to remove the 4th observation in the above example, but as a rule in my full sample. 

 

data have;
id date1 date2;
datalines;
a 1/1/2008 .
a 1/1/2009 .
a 1/1/2010 1/1/2009
a 1/1/2011 1/1/2009
a 1/1/2011 1/1/2010
;
run;

 

What I want:

 

id

date1

date2

1

1/1/2008

 

1

1/1/2009

 

1

1/1/2010

1/1/2009

1

1/1/2011

1/1/2010

 

What happens:

id

date1

date2

1

1/1/2011

1/1/2009

1

1/1/2011

1/1/2010

 

In summary, I need to remove only non-missing duplicates BY id date2, and I need to keep the observation in which date1 is the closest to date2. I have tried sorting by descending order, but it only keeps the date2 which is further away from date1.

 

I scanned the board before posting this, but I didn't see anything quite as unique as my issue. I appreciate any guidance you all can provide. Please let me know if I can improve my post at all.

 

Thank you

 

 

 

3 REPLIES 3
ballardw
Super User

Perhaps you could show how you merged the data so we, may be able to, show how to prevent the duplicates.

 

You need to state a general rule for exactly how to identify which are "duplicates" that need to removed in terms of data values.

I can tell you how to remove a "4th record" very easily. But unless your data is very small the approach would not be very easy to do.

 

data want;
   set have;
   if _n_ = 4 then delete;
run;

is one of several ways to remove one specific record.

phdstud2020
Calcite | Level 5
Thank you. I updated my post to include my merge procedure and clarify my "4th observation" comment. I am currently removing duplicates by id & date2
Tom
Super User Tom
Super User

If there are no other variables you want to keep then just group by ID and DATE2 and take MIN(DATE1) as your DATE values.   If there are other variables you are keeping then you could try using a HAVING() clause, but unless you have unique dates from both datasets you cannot do it directly in SQL code as you could still get multiple observations if there are two DATE1 values that are the same in the same ID , DATE2 group.

 

So just use a data step.  So if you have this data, sorted by ID DATE2 and DATE1 .

 

data have;
  id date1 :mmddyy. date2 :mmddyy.;
format date1 date2 mmddyy10.; datalines; a 1/1/2008 . a 1/1/2009 . a 1/1/2010 1/1/2009 a 1/1/2011 1/1/2009 a 1/1/2011 1/1/2010 ;

Then run this step to reduce to the first observation per ID and DATE1.

data want;
  set have;
  by id date2 date1;
  if first.date2;
run;

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 859 views
  • 1 like
  • 3 in conversation