BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BenBrady
Obsidian | Level 7

I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations

Cheers

 

IDprogram_idCOLLECTION_YR
19992009
24442010
3SRC2009
490342009
490342010
490692010
577672006
6309992009
1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Sort with the NODUPKEY option will get you there. 

 

Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;

Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;

Or a SQL solution 

 

proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;

And one more - data step - note this does require the sort from the first piece of code. 

Data unique3;
Set have_sorted;
By Id programID;

If first.programID;

Run;

 


@BenBrady wrote:

I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations

Cheers

 

ID program_id COLLECTION_YR
1 999 2009
2 444 2010
3 SRC 2009
4 9034 2009
4 9034 2010
4 9069 2010
5 7767 2006
6 30999 2009


 

View solution in original post

4 REPLIES 4
Reeza
Super User

Sort with the NODUPKEY option will get you there. 

 

Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;

Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;

Or a SQL solution 

 

proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;

And one more - data step - note this does require the sort from the first piece of code. 

Data unique3;
Set have_sorted;
By Id programID;

If first.programID;

Run;

 


@BenBrady wrote:

I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations

Cheers

 

ID program_id COLLECTION_YR
1 999 2009
2 444 2010
3 SRC 2009
4 9034 2009
4 9034 2010
4 9069 2010
5 7767 2006
6 30999 2009


 

art297
Opal | Level 21

One way is to use two proc sorts. e.g.:

proc sort data=have;
  by ID program_id descending COLLECTION_YR;
run;

proc sort data=have out=want nodupkey;
  by ID program_id;
run;

Art, CEO, AnalystFinder.com

 

ballardw
Super User

In a pedantic feeling mode: If two records have some values the same and at least one value different they are not "duplicates" they are similar.

 

So you are selecting between similar records.

mkeintz
PROC Star

If your data are aleady sorted by id /program_id/collection_yr:

 

data want;
  set have;
  by id program_id collection_yr;
  if last.program_id;
run;

 

If your data are not sorted, and sorting is expensive (i.e. it's a big data set), there is always proc summary:

proc summary data=have nway;
  class id program_id;
  var collection_yr;
  output out=want2 (drop=_type_ _freq_) max=;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1803 views
  • 3 likes
  • 5 in conversation