Removing duplicates based on two variables, but prioritizing based on ...

corkee · Posted 09-11-2019 12:29 PM

Hi,

I'm trying to remove duplicates based on two variables: ID and index_date. I know one could easily use the following code to do this:

proc sort data=dsn nodupkey;
	by id index_date;
run;

However, for rows with duplicates of ID and index_date, I want prioritize one row over the others. For example, my dataset looks like this:

ID   index_date   priority
1    1/4/18           0
1    1/4/18           1
1    1/4/18           0

I want to keep the row where priority = 1 and remove the ones priority = 0. How can I do this? Would sorting my data on ID, index_date, and descending priority first and then sorting by ID and index_date with nodupkey work?

Thanks!

novinosrin · Posted 09-11-2019 12:36 PM

if you wanna subset only the records with priority=1, wouldn't it be a simple

where priority=1;

in a datastep/sql?

corkee · Posted 09-11-2019 12:44 PM

Hi, novinosrin. No, I don't want to ONLY keep observations where priority = 1. There might be non-duplicates where priority = 0, and I want to keep those too. I just want to make sure that for duplicates, the ones where priority = 1 is kept. Thanks!

novinosrin · Posted 09-11-2019 12:45 PM

Can you please post a clear sample of what you "have" and what you "want"?

novinosrin · Posted 09-11-2019 12:55 PM

I modified your sample

data have;
input ID   index_date :mmddyy10.   priority;
format  index_date mmddyy10.;
cards;
1    1/4/18           0
1    1/4/18           1
1    1/4/18           0
1    1/6/18           0
1    1/6/18           0
1    1/6/18           0
;


proc sql;
create table want(drop=t) as
select *,count(distinct priority)>1 as t
from have
group by ID ,  index_date
having (t=1 and priority=1) or (t=0) ;
quit;

corkee · Posted 09-11-2019 01:01 PM

Hi @novinosrin,

I just tried your code, and unfortunately it did not give me the desired output.

Hi @novinosrin,

That code didn't work, unfortunately. Let me try to provide a more concrete example:

ID index_date priority
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1

The desired dataset would keep the 2nd, 4th, 5th (or 6th, but not both), and the last row. The code you had provided kept both the 5th and the 6th rows. Thanks for your help!

Edit: I think adding another code chunk would do the trick.

proc sort data=want out=want2 nodupkey;
by id index_date;
run;

novinosrin · Posted 09-11-2019 01:05 PM

Hello @corkee Thank you for clarifying. One last clarification, do you have many other variables or just that?

corkee · Posted 09-11-2019 01:09 PM

Hi @novinosrin, I have another variable that determines whether priority is = 1 or 0. I have to keep this variable for later uses. Thanks again!

novinosrin · Posted 09-11-2019 01:08 PM

Okay @corkee I think this one should do

data have;
input ID   index_date :mmddyy10.   priority;
format  index_date mmddyy10.;
cards;
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1
;

proc sort data=have out=temp ;
by id index_date descending priority;
run;
data want;
set temp;
by id index_date;
if first.index_date;
run;

Tom · Posted 09-11-2019 01:10 PM

If you want to use the NODUPKEY option on PROC SORT as the way to get to single observation per group then first sort them so the one you want to keep is first.

proc sort data=dsn ;
  by id index_date descending priority ;
run;
proc sort data=dsn nodupkey;
  by id index_date ;
run;

Or you can replace the second PROC SORT with a simple data step.

data dsn;
  set dsn ;
  by id index_date descending priority ;
  if first.index_date;
run;

corkee · Posted 09-11-2019 01:15 PM

Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,

ID   index_date   priority
1    1/2/19           1
2    1/2/19           0

I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.

Tom · Posted 09-11-2019 01:19 PM

@corkee wrote:

Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,
ID   index_date   priority
1    1/2/19           1
2    1/2/19           0
I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.

FIRST.INDEX_DATE means it is the first observation in the set of observations with the same values of ID and INDEX_DATE.

The FIRST.INDEX_DATE flag is not independent of the early variables in the BY statement. That would not make any sense and probably be impossible for SAS to even know how to calculate.

corkee · Posted 09-11-2019 01:22 PM

Thank you @Tom and @novinosrin! You guys are great!

novinosrin · Posted 09-11-2019 01:22 PM

Hi @corkee , Sir @Tom is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL

proc sql;
create table want(drop=t) as
select distinct *,count(distinct priority)>1 as t
from have
group by ID ,  index_date
having t and priority=1 or t=0 ;
quit;

Tom · Posted 09-11-2019 01:33 PM

@novinosrin wrote:

Hi @corkee , Sir @Tom is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL
proc sql;
create table want(drop=t) as
select distinct *,count(distinct priority)>1 as t
from have
group by ID ,  index_date
having t and priority=1 or t=0 ;
quit;

I am not sure how that is going to eliminate duplicates.

Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Re: Removing duplicates based on two variables, but prioritizing based on a third variable

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!