I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations
Cheers
ID | program_id | COLLECTION_YR |
1 | 999 | 2009 |
2 | 444 | 2010 |
3 | SRC | 2009 |
4 | 9034 | 2009 |
4 | 9034 | 2010 |
4 | 9069 | 2010 |
5 | 7767 | 2006 |
6 | 30999 | 2009 |
Sort with the NODUPKEY option will get you there.
Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;
Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;
Or a SQL solution
proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;
And one more - data step - note this does require the sort from the first piece of code.
Data unique3;
Set have_sorted;
By Id programID;
If first.programID;
Run;
@BenBrady wrote:
I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations
Cheers
ID program_id COLLECTION_YR 1 999 2009 2 444 2010 3 SRC 2009 4 9034 2009 4 9034 2010 4 9069 2010 5 7767 2006 6 30999 2009
Sort with the NODUPKEY option will get you there.
Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;
Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;
Or a SQL solution
proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;
And one more - data step - note this does require the sort from the first piece of code.
Data unique3;
Set have_sorted;
By Id programID;
If first.programID;
Run;
@BenBrady wrote:
I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations
Cheers
ID program_id COLLECTION_YR 1 999 2009 2 444 2010 3 SRC 2009 4 9034 2009 4 9034 2010 4 9069 2010 5 7767 2006 6 30999 2009
One way is to use two proc sorts. e.g.:
proc sort data=have; by ID program_id descending COLLECTION_YR; run; proc sort data=have out=want nodupkey; by ID program_id; run;
Art, CEO, AnalystFinder.com
In a pedantic feeling mode: If two records have some values the same and at least one value different they are not "duplicates" they are similar.
So you are selecting between similar records.
If your data are aleady sorted by id /program_id/collection_yr:
data want;
set have;
by id program_id collection_yr;
if last.program_id;
run;
If your data are not sorted, and sorting is expensive (i.e. it's a big data set), there is always proc summary:
proc summary data=have nway;
class id program_id;
var collection_yr;
output out=want2 (drop=_type_ _freq_) max=;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.