Solved: delete corresponding rows based on conditions

BenBrady · Posted 05-21-2017 08:21 PM

I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations

Cheers

ID	program_id	COLLECTION_YR
1	999	2009
2	444	2010
3	SRC	2009
4	9034	2009
4	9034	2010
4	9069	2010
5	7767	2006
6	30999	2009

Reeza · Posted 05-21-2017 08:56 PM

Sort with the NODUPKEY option will get you there.

Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;

Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;

Or a SQL solution

proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;

And one more - data step - note this does require the sort from the first piece of code.

Data unique3;
Set have_sorted;
By Id programID;

If first.programID;

Run;

@BenBrady wrote:

I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations

Cheers

ID program_id COLLECTION_YR

1 999 2009

2 444 2010

3 SRC 2009

4 9034 2009

4 9034 2010

4 9069 2010

5 7767 2006

6 30999 2009

View solution in original post

Reeza · Posted 05-21-2017 08:56 PM

Sort with the NODUPKEY option will get you there.

Proc sort data = have out=have_sorted;
By Id programID descending year; *puts year as first;
Run;

Proc sort data = have nodupkey out=unique1;
By Id programID;
Run;

Or a SQL solution

proc sql:
Create table unique2 as
Select Id, programID, max(year) as year
From have
Group by Id, programID;
Quit;

And one more - data step - note this does require the sort from the first piece of code.

Data unique3;
Set have_sorted;
By Id programID;

If first.programID;

Run;

@BenBrady wrote:

I want to delete duplicates where ID and program_id are matching. But I want to keep the duplicate with the most recent collection_yr. For example, if two observations have the same ID and program_id but the completion_yr is 2009 and 2010, I want to keep the observation with the 2010 completion_yr. I have attached an example data set with duplicates. I want to delete the observation with ID=4 and program_id=9034 but collection_yr=2009 as that is not the most recent. However, I want to do this on a larger scale where there are more observations

Cheers

ID program_id COLLECTION_YR

1 999 2009

2 444 2010

3 SRC 2009

4 9034 2009

4 9034 2010

4 9069 2010

5 7767 2006

6 30999 2009

art297 · Posted 05-21-2017 08:56 PM

One way is to use two proc sorts. e.g.:

proc sort data=have;
  by ID program_id descending COLLECTION_YR;
run;

proc sort data=have out=want nodupkey;
  by ID program_id;
run;

Art, CEO, AnalystFinder.com

ballardw · Posted 05-22-2017 10:41 AM

In a pedantic feeling mode: If two records have some values the same and at least one value different they are not "duplicates" they are similar.

So you are selecting between similar records.

mkeintz · Posted 05-22-2017 01:51 PM

If your data are aleady sorted by id /program_id/collection_yr:

data want;
  set have;
  by id program_id collection_yr;
  if last.program_id;
run;

If your data are not sorted, and sorting is expensive (i.e. it's a big data set), there is always proc summary:

proc summary data=have nway;
  class id program_id;
  var collection_yr;
  output out=want2 (drop=_type_ _freq_) max=;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Re: delete corresponding rows based on conditions

Click image to register for webinar

Classroom Training Available!