topic Re: Dedupe a dataset without sorting codes in SAS Enterprise Guide

Dedupe a dataset without sorting codes

shubham_d — Fri, 31 Mar 2023 13:23:02 GMT

I want to dedupe the duplicate codes based on their Score, the highest score stays followed by sorting high to low of score. But, if the score is the same & reason codes are different, I want to maintain the order of reason codes. The dedupe only happens across one id, ID - 1 would *not* look for duplicates in ID - 2

Example -

HAVE -

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02

1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07

;
run;

ID	NAME	SCORE	CODE
1	John	23	A05
1	John	26	A02
1	John	26	A03
1	John	26	A01
2	Mark	21	A05
2	Mark	22	A05
2	Mark	23	A07

WANT -

ID	NAME	SCORE	CODE
1	John	26	A02 [ Same Order as HAVE ]
1	John	26	A03
1	John	26	A01
1	John	23	A05
2	Mark	23	A07
2	Mark	22	A05

Re: Dedupe a dataset without sorting codes

LinusH — Fri, 31 Mar 2023 14:38:59 GMT

Not sure I get exactly what you want to do.

But if sort order is crucial for you application, you should have a unique (row?) id column in your data set.

Then do de-duplication, and then to be sure sort it by the earlier assigned id.

Re: Dedupe a dataset without sorting codes

HarrySnart — Fri, 31 Mar 2023 14:46:16 GMT

Hi @shubham_d

I think this gets you most of the way there. Could you expand on the requirements for the custom sort a bit more? You're only applying a sort on part of the data


*Read data in;

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07

;
run;

*Create key variable;
data want;
set have;
key = name || code;
run;

*Sort by key;
proc sort data=want ; by key;run;

*Only retain first key (i.e. highest score);
data want;
set want;
by key;
if first.key then output;
drop key;
run;

*Sort output data;
proc sort data=want ; by name descending  score;run;

Thanks

Harry

Re: Dedupe a dataset without sorting codes

FreelanceReinh — Fri, 31 Mar 2023 15:08:35 GMT

Hello @shubham_d,

Here's another suggestion: First, sort by ID and descending score, then remove duplicates.

proc sort data=have out=temp;
by id descending score;
run;

data want(drop=_:);
array _c[999] $; /* increase dimension if >999 obs. per ID */
do _i=1 by 1 until(last.id);
  set temp;
  by id;
  if code ~in _c then do;
    output;
    _c[_i]=code;  
  end;
end;
run;

If performance is an issue, you may want to use a _temporary_ array or a hash object or at least adapt dimension and variable length of array _c.

Re: Dedupe a dataset without sorting codes

Quentin — Fri, 31 Mar 2023 15:32:21 GMT

I think I'm understanding your logic. You're deduplicating by ID - Code, right? And then you want to just sort the data? Agree with @LinusH about adding a temporary row variable to help with restoring the sort order.

data have;
input id name $ score code $;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;
run;

*add a row counter, so can restore sort order at the end ;
data want1 ;
  set have ;
  row++1 ;
run ;

*Deduplicate by id-code.  ;
*Select the code with the maximum score ;

proc sort data=want1 ;
  by id code descending score ;
run ;

data want ;
  set want1 ;
  by id code descending score ;
  if first.code ;
run ;

*sort to desired order;

proc sort data=want ;
  by id descending score row ;
run ;

proc print data=want;
run ;

Re: Dedupe a dataset without sorting codes

Tom — Fri, 31 Mar 2023 15:37:36 GMT

If the original order is important then add a variables to indicate the original order.

You could do it when you create the dataset.

data have;
  input id name $ score code $;
  row+1;
datalines;
1 John 23 A05
1 John 26 A02
1 John 26 A03
1 John 26 A01
2 Mark 21 A05
2 Mark 22 A05
2 Mark 23 A07
;

Or add it later. Perhaps as a VIEW?

data have_order / view=have_order ;
  set have;
  row+1;
run;

Now sort by ID NAME CODE and descending SCORE. Then you can de-dupe per ID NAME CODE. Then you can re-sort by the ID NAME descending SCORE ROW to get the codes in the original order.

proc sort data=have out=step1 ;
  by id name code descending score ;
run;
proc sort data=step1 out=step2 nodupkey;
  by id name code ;
run;
proc sort data=step2 out=want;
  by id name descending score row ;
run;

Obs    id    name    score    code    row

 1      1    John      26     A02      2
 2      1    John      26     A03      3
 3      1    John      26     A01      4
 4      1    John      23     A05      1
 5      2    Mark      23     A07      7
 6      2    Mark      22     A05      6

Re: Dedupe a dataset without sorting codes

shubham_d — Fri, 31 Mar 2023 18:02:49 GMT

Hey Harry! Thanks for taking out the time to resolve this.

The original order is important here when it comes to sorting an id that has the same score but a different code

Yes, your code almost got me there, but for better readability instead of using key, I am now using Obs/row number to maintain the original order

Re: Dedupe a dataset without sorting codes

shubham_d — Fri, 31 Mar 2023 18:06:51 GMT

Hey Quentin! Thanks for taking time out to resolve this. Yes, I want to deduplicate the redundant data which have same codes but a less score.

Maintaining a row number helped! Thanks a lot 😄

Re: Dedupe a dataset without sorting codes

shubham_d — Fri, 31 Mar 2023 18:08:09 GMT

Hey Tom! Thanks for taking time out to resolve this. It worked!!

Loved how you & others used the row number to maintain the original order.

Very readable & easy to understand solution. Thanks!

Re: Dedupe a dataset without sorting codes

shubham_d — Fri, 31 Mar 2023 18:11:04 GMT

Hey! Thanks for taking the time to resolve this. Your solution worked.

I am a bit new to SAS & still learning how to work with arrays. Using a row/obs number to maintain the original order worked very well as an alternate solution. Will learn about arrays & try to understand your solution. Thanks 😄

Re: Dedupe a dataset without sorting codes

shubham_d — Fri, 31 Mar 2023 18:12:24 GMT

Hey Tom! Thanks for taking the time to resolve this. Yes, creating a column with an obs/row number worked here.