BookmarkSubscribeRSS Feed
corkee
Calcite | Level 5

Hi,

 

I'm trying to remove duplicates based on two variables: ID and index_date. I know one could easily use the following code to do this:

 

proc sort data=dsn nodupkey;
	by id index_date;
run;

However, for rows with duplicates of ID and index_date, I want prioritize one row over the others. For example, my dataset looks like this:

 

ID   index_date   priority
1    1/4/18           0
1    1/4/18           1
1    1/4/18           0

I want to keep the row where priority = 1 and remove the ones priority = 0. How can I do this? Would sorting my data on ID, index_date, and descending priority first and then sorting by ID and index_date with nodupkey work?

 

 

Thanks!

19 REPLIES 19
novinosrin
Tourmaline | Level 20

if you wanna subset only the records with priority=1, wouldn't it be a simple

 

where priority=1;

in a datastep/sql?

corkee
Calcite | Level 5

Hi, novinosrin. No, I don't want to ONLY keep observations where priority = 1. There might be non-duplicates where priority = 0, and I want to keep those too. I just want to make sure that for duplicates, the ones where priority = 1 is kept. Thanks!

novinosrin
Tourmaline | Level 20

Can you please post a clear sample of what you "have" and what you "want"?

novinosrin
Tourmaline | Level 20

I modified your sample

 

data have;
input ID   index_date :mmddyy10.   priority;
format  index_date mmddyy10.;
cards;
1    1/4/18           0
1    1/4/18           1
1    1/4/18           0
1    1/6/18           0
1    1/6/18           0
1    1/6/18           0
;


proc sql;
create table want(drop=t) as
select *,count(distinct priority)>1 as t
from have
group by ID ,  index_date
having (t=1 and priority=1) or (t=0) ;
quit;
corkee
Calcite | Level 5

Hi @novinosrin,

 

I just tried your code, and unfortunately it did not give me the desired output. 

Hi @novinosrin,

 

That code didn't work, unfortunately. Let me try to provide a more concrete example:

 

ID index_date priority
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1


The desired dataset would keep the 2nd, 4th, 5th (or 6th, but not both), and the last row. The code you had provided kept both the 5th and the 6th rows. Thanks for your help!

 

Edit: I think adding another code chunk would do the trick.

proc sort data=want out=want2 nodupkey;
by id index_date;
run;
novinosrin
Tourmaline | Level 20

Hello @corkee  Thank you for clarifying. One last clarification, do you have many other variables or just that?

corkee
Calcite | Level 5

Hi @novinosrin, I have another variable that determines whether priority is = 1 or 0. I have to keep this variable for later uses. Thanks again!

novinosrin
Tourmaline | Level 20

Okay @corkee  I think this one should do

 

data have;
input ID   index_date :mmddyy10.   priority;
format  index_date mmddyy10.;
cards;
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1
;

proc sort data=have out=temp ;
by id index_date descending priority;
run;
data want;
set temp;
by id index_date;
if first.index_date;
run;
Tom
Super User Tom
Super User

If you want to use the NODUPKEY option on PROC SORT as the way to get to single observation per group then first sort them so the one you want to keep is first.

proc sort data=dsn ;
  by id index_date descending priority ;
run;
proc sort data=dsn nodupkey;
  by id index_date ;
run;

Or you can replace the second PROC SORT with a simple data step.

data dsn;
  set dsn ;
  by id index_date descending priority ;
  if first.index_date;
run;
corkee
Calcite | Level 5

Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,

 

ID   index_date   priority
1    1/2/19           1
2    1/2/19           0

I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.

Tom
Super User Tom
Super User

@corkee wrote:

Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,

 

ID   index_date   priority
1    1/2/19           1
2    1/2/19           0

I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.


FIRST.INDEX_DATE means it is the first observation in the set of observations with the same values of ID and INDEX_DATE. 

The FIRST.INDEX_DATE flag is not independent of the early variables in the BY statement. That would not make any sense and probably be impossible for SAS to even know how to calculate.

novinosrin
Tourmaline | Level 20

Hi @corkee , Sir @Tom  is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL

 

proc sql;
create table want(drop=t) as
select distinct *,count(distinct priority)>1 as t
from have
group by ID ,  index_date
having t and priority=1 or t=0 ;
quit;

 

Tom
Super User Tom
Super User

@novinosrin wrote:

Hi @corkee , Sir @Tom  is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL

 

proc sql;
create table want(drop=t) as
select distinct *,count(distinct priority)>1 as t
from have
group by ID ,  index_date
having t and priority=1 or t=0 ;
quit;

 


I am not sure how that is going to eliminate duplicates.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 4682 views
  • 3 likes
  • 3 in conversation