I want to dedupe the duplicate codes based on their Score, the highest score stays followed by sorting high to low of score. But, if the score is the same & reason codes are different, I want to maintain the order of reason codes. The dedupe only happens across one id, ID - 1 would *not* look for duplicates in ID - 2
Example -
HAVE -
data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
run;
ID | NAME | SCORE | CODE |
1 | John | 23 | A05 |
1 | John | 26 | A02 |
1 | John | 26 | A03 |
1 | John | 26 | A01 |
2 | Mark | 21 | A05 |
2 | Mark | 22 | A05 |
2 | Mark | 23 | A07 |
WANT -
ID | NAME | SCORE | CODE |
1 | John | 26 | A02 [ Same Order as HAVE ] |
1 | John | 26 | A03 |
1 | John | 26 | A01 |
1 | John | 23 | A05 |
2 | Mark | 23 | A07 |
2 | Mark | 22 | A05 |
If the original order is important then add a variables to indicate the original order.
You could do it when you create the dataset.
data have;
input id name $ score code $;
row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
Or add it later. Perhaps as a VIEW?
data have_order / view=have_order ;
set have;
row+1;
run;
Now sort by ID NAME CODE and descending SCORE. Then you can de-dupe per ID NAME CODE. Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.
proc sort data=have out=step1 ;
by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
by id name code ;
run;
proc sort data=step2 out=want;
by id name descending score row ;
run;
Obs id name score code row 1 1 John 26 A02 2 2 1 John 26 A03 3 3 1 John 26 A01 4 4 1 John 23 A05 1 5 2 Mark 23 A07 7 6 2 Mark 22 A05 6
Not sure I get exactly what you want to do.
But if sort order is crucial for you application, you should have a unique (row?) id column in your data set.
Then do de-duplication, and then to be sure sort it by the earlier assigned id.
Hi @shubham_d
I think this gets you most of the way there. Could you expand on the requirements for the custom sort a bit more? You're only applying a sort on part of the data
*Read data in;
data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
run;
*Create key variable;
data want;
set have;
key = name || code;
run;
*Sort by key;
proc sort data=want ; by key;run;
*Only retain first key (i.e. highest score);
data want;
set want;
by key;
if first.key then output;
drop key;
run;
*Sort output data;
proc sort data=want ; by name descending score;run;
Thanks
Harry
Hello @shubham_d,
Here's another suggestion: First, sort by ID and descending score, then remove duplicates.
proc sort data=have out=temp;
by id descending score;
run;
data want(drop=_:);
array _c[999] $; /* increase dimension if >999 obs. per ID */
do _i=1 by 1 until(last.id);
set temp;
by id;
if code ~in _c then do;
output;
_c[_i]=code;
end;
end;
run;
If performance is an issue, you may want to use a _temporary_ array or a hash object or at least adapt dimension and variable length of array _c.
I think I'm understanding your logic. You're deduplicating by ID - Code, right? And then you want to just sort the data? Agree with @LinusH about adding a temporary row variable to help with restoring the sort order.
data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
run;
*add a row counter, so can restore sort order at the end ;
data want1 ;
set have ;
row++1 ;
run ;
*Deduplicate by id-code. ;
*Select the code with the maximum score ;
proc sort data=want1 ;
by id code descending score ;
run ;
data want ;
set want1 ;
by id code descending score ;
if first.code ;
run ;
*sort to desired order;
proc sort data=want ;
by id descending score row ;
run ;
proc print data=want;
run ;
If the original order is important then add a variables to indicate the original order.
You could do it when you create the dataset.
data have;
input id name $ score code $;
row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
Or add it later. Perhaps as a VIEW?
data have_order / view=have_order ;
set have;
row+1;
run;
Now sort by ID NAME CODE and descending SCORE. Then you can de-dupe per ID NAME CODE. Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.
proc sort data=have out=step1 ;
by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
by id name code ;
run;
proc sort data=step2 out=want;
by id name descending score row ;
run;
Obs id name score code row 1 1 John 26 A02 2 2 1 John 26 A03 3 3 1 John 26 A01 4 4 1 John 23 A05 1 5 2 Mark 23 A07 7 6 2 Mark 22 A05 6
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.