- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to remove duplicates based on two variables: ID and index_date. I know one could easily use the following code to do this:
proc sort data=dsn nodupkey;
by id index_date;
run;
However, for rows with duplicates of ID and index_date, I want prioritize one row over the others. For example, my dataset looks like this:
ID index_date priority
1 1/4/18 0
1 1/4/18 1
1 1/4/18 0
I want to keep the row where priority = 1 and remove the ones priority = 0. How can I do this? Would sorting my data on ID, index_date, and descending priority first and then sorting by ID and index_date with nodupkey work?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
if you wanna subset only the records with priority=1, wouldn't it be a simple
where priority=1;
in a datastep/sql?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, novinosrin. No, I don't want to ONLY keep observations where priority = 1. There might be non-duplicates where priority = 0, and I want to keep those too. I just want to make sure that for duplicates, the ones where priority = 1 is kept. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you please post a clear sample of what you "have" and what you "want"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I modified your sample
data have;
input ID index_date :mmddyy10. priority;
format index_date mmddyy10.;
cards;
1 1/4/18 0
1 1/4/18 1
1 1/4/18 0
1 1/6/18 0
1 1/6/18 0
1 1/6/18 0
;
proc sql;
create table want(drop=t) as
select *,count(distinct priority)>1 as t
from have
group by ID , index_date
having (t=1 and priority=1) or (t=0) ;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @novinosrin,
I just tried your code, and unfortunately it did not give me the desired output.
Hi @novinosrin,
That code didn't work, unfortunately. Let me try to provide a more concrete example:
ID index_date priority
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1
The desired dataset would keep the 2nd, 4th, 5th (or 6th, but not both), and the last row. The code you had provided kept both the 5th and the 6th rows. Thanks for your help!
Edit: I think adding another code chunk would do the trick.
proc sort data=want out=want2 nodupkey;
by id index_date;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @corkee Thank you for clarifying. One last clarification, do you have many other variables or just that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @novinosrin, I have another variable that determines whether priority is = 1 or 0. I have to keep this variable for later uses. Thanks again!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Okay @corkee I think this one should do
data have;
input ID index_date :mmddyy10. priority;
format index_date mmddyy10.;
cards;
1 2/3/18 0
1 2/3/18 1
1 2/3/18 0
2 3/4/15 0
3 7/14/14 0
3 7/14/14 0
4 10/1/17 1
;
proc sort data=have out=temp ;
by id index_date descending priority;
run;
data want;
set temp;
by id index_date;
if first.index_date;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want to use the NODUPKEY option on PROC SORT as the way to get to single observation per group then first sort them so the one you want to keep is first.
proc sort data=dsn ;
by id index_date descending priority ;
run;
proc sort data=dsn nodupkey;
by id index_date ;
run;
Or you can replace the second PROC SORT with a simple data step.
data dsn;
set dsn ;
by id index_date descending priority ;
if first.index_date;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,
ID index_date priority
1 1/2/19 1
2 1/2/19 0
I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@corkee wrote:
Hi @novinosrin and @Tom, both of your solutions worked and were actually what I had in mind at first. However, I was a bit hesitant because I wasn't sure if if first.index_date would remove just ANY duplicates of index_date. For example,
ID index_date priority 1 1/2/19 1 2 1/2/19 0
I wouldn't want either rows to be removed, since the ID's are different, even though the index_date's are the same. Could you confirm this is the case? Thanks so much.
FIRST.INDEX_DATE means it is the first observation in the set of observations with the same values of ID and INDEX_DATE.
The FIRST.INDEX_DATE flag is not independent of the early variables in the BY statement. That would not make any sense and probably be impossible for SAS to even know how to calculate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Tom and @novinosrin! You guys are great!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @corkee , Sir @Tom is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL
proc sql;
create table want(drop=t) as
select distinct *,count(distinct priority)>1 as t
from have
group by ID , index_date
having t and priority=1 or t=0 ;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@novinosrin wrote:
Hi @corkee , Sir @Tom is of course absolutely better and experienced in explaining the functionality. From me, I would like increase the buffet menu by offering a Proc SQL
proc sql; create table want(drop=t) as select distinct *,count(distinct priority)>1 as t from have group by ID , index_date having t and priority=1 or t=0 ; quit;
I am not sure how that is going to eliminate duplicates.