Hi,
I'm trying to remove duplicates based on two variables: ID and index_date. I know one could easily use the following code to do this:
proc sort data=dsn nodupkey;
by id index_date;
run;
However, for rows with duplicates of ID and index_date, I want prioritize one row over the others. For example, my dataset looks like this:
ID index_date priority
1 1/4/18 0
1 1/4/18 1
1 1/4/18 0
I want to keep the row where priority = 1 and remove the ones priority = 0. How can I do this? Would sorting my data on ID, index_date, and descending priority first and then sorting by ID and index_date with nodupkey work?
Thanks!
if you wanna subset only the records with priority=1, wouldn't it be a simple
where priority=1;
in a datastep/sql?
Hi, novinosrin. No, I don't want to ONLY keep observations where priority = 1. There might be non-duplicates where priority = 0, and I want to keep those too. I just want to make sure that for duplicates, the ones where priority = 1 is kept. Thanks!
Can you please post a clear sample of what you "have" and what you "want"?
I modified your sample
data have;
input ID index_date :mmddyy10. priority;
format index_date mmddyy10.;
cards;
1 1/4/18 0
1 1/4/18 1
1 1/4/18 0
1 1/6/18 0
1 1/6/18 0
1 1/6/18 0
;
proc sql;
create table want(drop=t) as
select *,count(distinct priority)>1 as t
from have
group by ID , index_date
having (t=1 and priority=1) or (t=0) ;
quit;
Hi @novinosrin,
I just tried your code, and unfortunately it did not give me the desired output.
Hi @novinosrin,
That code didn't work, unfortunately. Let me try to provide a more concrete example:
ID index_date priority
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1
The desired dataset would keep the 2nd, 4th, 5th (or 6th, but not both), and the last row. The code you had provided kept both the 5th and the 6th rows. Thanks for your help!
Edit: I think adding another code chunk would do the trick.
proc sort data=want out=want2 nodupkey;
by id index_date;
run;
Hello @corkee Thank you for clarifying. One last clarification, do you have many other variables or just that?
Hi @novinosrin, I have another variable that determines whether priority is = 1 or 0. I have to keep this variable for later uses. Thanks again!
Okay @corkee I think this one should do
data have;
input ID index_date :mmddyy10. priority;
format index_date mmddyy10.;
cards;
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1
;
proc sort data=have out=temp ;
by id index_date descending priority;
run;
data want;
set temp;
by id index_date;
if first.index_date;
run;
If you want to use the NODUPKEY option on PROC SORT as the way to get to single observation per group then first sort them so the one you want to keep is first.
proc sort data=dsn ;
by id index_date descending priority ;
run;
proc sort data=dsn nodupkey;
by id index_date ;
run;
Or you can replace the second PROC SORT with a simple data step.
data dsn;
set dsn ;
by id index_date descending priority ;
if first.index_date;
run;
Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,
ID index_date priority
1 1/2/19 1
2 1/2/19 0
I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.
@corkee wrote:
Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,
ID index_date priority 1 1/2/19 1 2 1/2/19 0
I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.
FIRST.INDEX_DATE means it is the first observation in the set of observations with the same values of ID and INDEX_DATE.
The FIRST.INDEX_DATE flag is not independent of the early variables in the BY statement. That would not make any sense and probably be impossible for SAS to even know how to calculate.
Thank you @Tom and @novinosrin! You guys are great!
Hi @corkee , Sir @Tom is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL
proc sql;
create table want(drop=t) as
select distinct *,count(distinct priority)>1 as t
from have
group by ID , index_date
having t and priority=1 or t=0 ;
quit;
@novinosrin wrote:
Hi @corkee , Sir @Tom is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL
proc sql; create table want(drop=t) as select distinct *,count(distinct priority)>1 as t from have group by ID , index_date having t and priority=1 or t=0 ; quit;
I am not sure how that is going to eliminate duplicates.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.