BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
shubham_d
Fluorite | Level 6

I want to dedupe the duplicate codes based on their Score, the highest score stays followed by sorting high to low of score. But, if the score is the same & reason codes are different, I want to maintain the order of reason codes. The dedupe only happens across one id, ID - 1 would *not* look for duplicates in ID - 2

 

Example - 

 

HAVE - 

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02

1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07

;
run;

 

IDNAMESCORECODE
1John23A05
1John26A02
1John26A03
1John26A01
2Mark21A05
2Mark22A05
2Mark23A07

 

WANT - 

 

IDNAMESCORECODE
1John26A02 [ Same Order as HAVE ]
John26A03
1John26A01
1John23A05
2Mark23A07
2Mark22A05

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If the original order is important then add a variables to indicate the original order.

You could do it when you create the dataset.

data have;
  input id name $ score code $;
  row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;

Or add it later.  Perhaps as a VIEW?

data have_order / view=have_order ;
  set have;
  row+1;
run;

Now sort by ID NAME CODE and descending SCORE.  Then you can de-dupe per ID NAME CODE.  Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.

proc sort data=have out=step1 ;
  by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
  by id name code ;
run;
proc sort data=step2 out=want;
  by id name descending score row ;
run;
Obs    id    name    score    code    row

 1      1    John      26     A02      2
 2      1    John      26     A03      3
 3      1    John      26     A01      4
 4      1    John      23     A05      1
 5      2    Mark      23     A07      7
 6      2    Mark      22     A05      6

 

View solution in original post

10 REPLIES 10
LinusH
Tourmaline | Level 20

Not sure I get exactly what you want to do.

But if sort order is crucial for you application, you should have a unique (row?) id column in your data set.

Then do de-duplication, and then to be sure sort it by the earlier assigned id.

Data never sleeps
shubham_d
Fluorite | Level 6
Hey Tom! Thanks for taking the time to resolve this. Yes, creating a column with an obs/row number worked here.
HarrySnart
SAS Employee

Hi @shubham_d 

 

I think this gets you most of the way there. Could you expand on the requirements for the custom sort a bit more? You're only applying a sort on part of the data

 


*Read data in;

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07

;
run;

*Create key variable;
data want;
set have;
key = name || code;
run;

*Sort by key;
proc sort data=want ; by key;run;

*Only retain first key (i.e. highest score);
data want;
set want;
by key;
if first.key then output;
drop key;
run;

*Sort output data;
proc sort data=want ; by name descending  score;run;

Thanks

Harry

shubham_d
Fluorite | Level 6
Hey Harry! Thanks for taking out the time to resolve this.

The original order is important here when it comes to sorting an id that has the same score but a different code

Yes, your code almost got me there, but for better readability instead of using key, I am now using Obs/row number to maintain the original order
FreelanceReinh
Jade | Level 19

Hello @shubham_d,

 

Here's another suggestion: First, sort by ID and descending score, then remove duplicates.

proc sort data=have out=temp;
by id descending score;
run;

data want(drop=_:);
array _c[999] $; /* increase dimension if >999 obs. per ID */
do _i=1 by 1 until(last.id);
  set temp;
  by id;
  if code ~in _c then do;
    output;
    _c[_i]=code;  
  end;
end;
run;

If performance is an issue, you may want to use a _temporary_ array or a hash object or at least adapt dimension and variable length of array _c.

shubham_d
Fluorite | Level 6
Hey! Thanks for taking the time to resolve this. Your solution worked.

I am a bit new to SAS & still learning how to work with arrays. Using a row/obs number to maintain the original order worked very well as an alternate solution. Will learn about arrays & try to understand your solution. Thanks 😄
Quentin
Super User

I think I'm understanding your logic.  You're deduplicating by ID - Code, right?  And then you want to just sort the data?  Agree with @LinusH about adding a temporary row variable to help with restoring the sort order.  

 

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
run;

*add a row counter, so can restore sort order at the end ;
data want1 ;
  set have ;
  row++1 ;
run ;

*Deduplicate by id-code.  ;
*Select the code with the maximum score ;

proc sort data=want1 ;
  by id code descending score ;
run ;

data want ;
  set want1 ;
  by id code descending score ;
  if first.code ;
run ;

*sort to desired order;
proc sort data=want ; by id descending score row ; run ; proc print data=want; run ;
BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
shubham_d
Fluorite | Level 6
Hey Quentin! Thanks for taking time out to resolve this. Yes, I want to deduplicate the redundant data which have same codes but a less score.

Maintaining a row number helped! Thanks a lot 😄
Tom
Super User Tom
Super User

If the original order is important then add a variables to indicate the original order.

You could do it when you create the dataset.

data have;
  input id name $ score code $;
  row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;

Or add it later.  Perhaps as a VIEW?

data have_order / view=have_order ;
  set have;
  row+1;
run;

Now sort by ID NAME CODE and descending SCORE.  Then you can de-dupe per ID NAME CODE.  Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.

proc sort data=have out=step1 ;
  by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
  by id name code ;
run;
proc sort data=step2 out=want;
  by id name descending score row ;
run;
Obs    id    name    score    code    row

 1      1    John      26     A02      2
 2      1    John      26     A03      3
 3      1    John      26     A01      4
 4      1    John      23     A05      1
 5      2    Mark      23     A07      7
 6      2    Mark      22     A05      6

 

shubham_d
Fluorite | Level 6
Hey Tom! Thanks for taking time out to resolve this. It worked!!

Loved how you & others used the row number to maintain the original order.

Very readable & easy to understand solution. Thanks!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1073 views
  • 1 like
  • 6 in conversation