I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations
Cheers
ID | program_id | COLLECTION_YR |
1 | 999 | 2009 |
2 | 444 | 2010 |
3 | SRC | 2009 |
4 | 9034 | 2009 |
4 | 9034 | 2010 |
4 | 9069 | 2010 |
5 | 7767 | 2006 |
6 | 30999 | 2009 |
Sort with the NODUPKEY option will get you there.
Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;
Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;
Or a SQL solution
proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;
And one more - data step - note this does require the sort from the first piece of code.
Data unique3;
Set have_sorted;
By Id programID;
If first.programID;
Run;
@BenBrady wrote:
I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations
Cheers
ID program_id COLLECTION_YR 1 999 2009 2 444 2010 3 SRC 2009 4 9034 2009 4 9034 2010 4 9069 2010 5 7767 2006 6 30999 2009
Sort with the NODUPKEY option will get you there.
Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;
Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;
Or a SQL solution
proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;
And one more - data step - note this does require the sort from the first piece of code.
Data unique3;
Set have_sorted;
By Id programID;
If first.programID;
Run;
@BenBrady wrote:
I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations
Cheers
ID program_id COLLECTION_YR 1 999 2009 2 444 2010 3 SRC 2009 4 9034 2009 4 9034 2010 4 9069 2010 5 7767 2006 6 30999 2009
One way is to use two proc sorts. e.g.:
proc sort data=have; by ID program_id descending COLLECTION_YR; run; proc sort data=have out=want nodupkey; by ID program_id; run;
Art, CEO, AnalystFinder.com
In a pedantic feeling mode: If two records have some values the same and at least one value different they are not "duplicates" they are similar.
So you are selecting between similar records.
If your data are aleady sorted by id /program_id/collection_yr:
data want;
set have;
by id program_id collection_yr;
if last.program_id;
run;
If your data are not sorted, and sorting is expensive (i.e. it's a big data set), there is always proc summary:
proc summary data=have nway;
class id program_id;
var collection_yr;
output out=want2 (drop=_type_ _freq_) max=;
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.