Solved: Re: Dedupe a dataset without sorting codes

shubham_d · Posted 03-31-2023 09:17 AM

I want to dedupe the duplicate codes based on their Score, the highest score stays followed by sorting high to low of score. But, if the score is the same & reason codes are different, I want to maintain the order of reason codes. The dedupe only happens across one id, ID - 1 would *not* look for duplicates in ID - 2

Example -

HAVE -

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02

1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07

;
run;

ID	NAME	SCORE	CODE
1	John	23	A05
1	John	26	A02
1	John	26	A03
1	John	26	A01
2	Mark	21	A05
2	Mark	22	A05
2	Mark	23	A07

WANT -

ID	NAME	SCORE	CODE
1	John	26	A02 [ Same Order as HAVE ]
1	John	26	A03
1	John	26	A01
1	John	23	A05
2	Mark	23	A07
2	Mark	22	A05

Tom · Posted 03-31-2023 11:37 AM

If the original order is important then add a variables to indicate the original order.

You could do it when you create the dataset.

data have;
  input id name $ score code $;
  row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;

Or add it later. Perhaps as a VIEW?

data have_order / view=have_order ;
  set have;
  row+1;
run;

Now sort by ID NAME CODE and descending SCORE. Then you can de-dupe per ID NAME CODE. Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.

proc sort data=have out=step1 ;
  by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
  by id name code ;
run;
proc sort data=step2 out=want;
  by id name descending score row ;
run;

Obs    id    name    score    code    row

 1      1    John      26     A02      2
 2      1    John      26     A03      3
 3      1    John      26     A01      4
 4      1    John      23     A05      1
 5      2    Mark      23     A07      7
 6      2    Mark      22     A05      6

View solution in original post

LinusH · Posted 03-31-2023 10:38 AM

Not sure I get exactly what you want to do.

But if sort order is crucial for you application, you should have a unique (row?) id column in your data set.

Then do de-duplication, and then to be sure sort it by the earlier assigned id.

Data never sleeps

shubham_d · Posted 03-31-2023 02:12 PM

Hey Tom! Thanks for taking the time to resolve this. Yes, creating a column with an obs/row number worked here.

HarrySnart · Posted 03-31-2023 10:46 AM

Hi @shubham_d

I think this gets you most of the way there. Could you expand on the requirements for the custom sort a bit more? You're only applying a sort on part of the data


*Read data in;

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07

;
run;

*Create key variable;
data want;
set have;
key = name || code;
run;

*Sort by key;
proc sort data=want ; by key;run;

*Only retain first key (i.e. highest score);
data want;
set want;
by key;
if first.key then output;
drop key;
run;

*Sort output data;
proc sort data=want ; by name descending  score;run;

Thanks

Harry

shubham_d · Posted 03-31-2023 02:02 PM

Hey Harry! Thanks for taking out the time to resolve this.

The original order is important here when it comes to sorting an id that has the same score but a different code

Yes, your code almost got me there, but for better readability instead of using key, I am now using Obs/row number to maintain the original order

FreelanceReinh · Posted 03-31-2023 11:08 AM

Hello @shubham_d,

Here's another suggestion: First, sort by ID and descending score, then remove duplicates.

proc sort data=have out=temp;
by id descending score;
run;

data want(drop=_:);
array _c[999] $; /* increase dimension if >999 obs. per ID */
do _i=1 by 1 until(last.id);
  set temp;
  by id;
  if code ~in _c then do;
    output;
    _c[_i]=code;  
  end;
end;
run;

If performance is an issue, you may want to use a _temporary_ array or a hash object or at least adapt dimension and variable length of array _c.

shubham_d · Posted 03-31-2023 02:11 PM

Hey! Thanks for taking the time to resolve this. Your solution worked.

I am a bit new to SAS & still learning how to work with arrays. Using a row/obs number to maintain the original order worked very well as an alternate solution. Will learn about arrays & try to understand your solution. Thanks 😄

Quentin · Posted 03-31-2023 11:32 AM

I think I'm understanding your logic. You're deduplicating by ID - Code, right? And then you want to just sort the data? Agree with @LinusH about adding a temporary row variable to help with restoring the sort order.

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
run;

*add a row counter, so can restore sort order at the end ;
data want1 ;
  set have ;
  row++1 ;
run ;

*Deduplicate by id-code.  ;
*Select the code with the maximum score ;

proc sort data=want1 ;
  by id code descending score ;
run ;

data want ;
  set want1 ;
  by id code descending score ;
  if first.code ;
run ;

*sort to desired order;

proc sort data=want ;
  by id descending score row ;
run ;

proc print data=want;
run ;

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

shubham_d · Posted 03-31-2023 02:06 PM

Hey Quentin! Thanks for taking time out to resolve this. Yes, I want to deduplicate the redundant data which have same codes but a less score.

Maintaining a row number helped! Thanks a lot 😄

Tom · Posted 03-31-2023 11:37 AM

If the original order is important then add a variables to indicate the original order.

You could do it when you create the dataset.

data have;
  input id name $ score code $;
  row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;

Or add it later. Perhaps as a VIEW?

data have_order / view=have_order ;
  set have;
  row+1;
run;

Now sort by ID NAME CODE and descending SCORE. Then you can de-dupe per ID NAME CODE. Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.

proc sort data=have out=step1 ;
  by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
  by id name code ;
run;
proc sort data=step2 out=want;
  by id name descending score row ;
run;

Obs    id    name    score    code    row

 1      1    John      26     A02      2
 2      1    John      26     A03      3
 3      1    John      26     A01      4
 4      1    John      23     A05      1
 5      2    Mark      23     A07      7
 6      2    Mark      22     A05      6

shubham_d · Posted 03-31-2023 02:08 PM

Hey Tom! Thanks for taking time out to resolve this. It worked!!

Loved how you & others used the row number to maintain the original order.

Very readable & easy to understand solution. Thanks!

Classroom Training Available!