Solved: Remove duplicates (based on non-numeric identifier)

Yegen · Posted 01-09-2018 04:51 PM

I am trying to reduce duplicates in my dataset. Although the identifier is effectively the same, it is in based revered client numbers. In other words, if we would have replaced the client numbers in the above example for row 2 (in the have dataset), then the uniqueness identifier would be the same. My code works well when the client numbers are numerically, but unfortunately, my actual client numbers are strings. Here is an example where my code works well:

data work.have_example;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123 456
 6363 456 123
;
run;

data work.want_example;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123 456
;
run;


*Here is what I have tried to do;

data temp1;
 set have; 
 uniqueness_identifier1=CATS(manager,client1,client2);
 uniqueness_identifier2=CATS(manager,client2,client1);
run;

data temp1;
 set temp1;
 if uniqueness_identifier1>uniqueness_identifier2 then uniqueness_identifier3=uniqueness_identifier1; else uniqueness_identifier3=uniqueness_identifier2;
run;

proc sort data=temp1 out=temp2 nodupkey;
 by uniqueness_identifier3;
quit;

Below is what my actual data looks like and what I would need. I am stuck how to remove the duplicates when the client numbers are non-numeric. Any feedback will be greatly appreciate.

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.want_actual;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123A 45C6C
;
run;

snoopy369 · Posted 01-09-2018 05:02 PM

CALL SORTC will sort a list of character variables _on that row_ alphabetically. So in this case this is quite useful.

If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like

id1 = ABC

id2 = DEF

vs

id1 = AB

id2 = CDEF

those cats'ed together are identical, but aren't actually.

Here's an example preserving client1/client2 order.

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 $ client2 $;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.temporary;
  set have_actual;
  array client[2];
  array cl_temp[2] $ _temporary_;
  do _i = 1 to dim(cl_temp);
    cl_temp[_i] = client[_i];
  end;
  call sortc(of cl_temp[*]);
  client_unique = catx('|',of cl_temp[*]);
run;

proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;

And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables. You could just as easily do what you did above, effectively, just as easily with character variables as numeric - with the same possible pitfalls, of course. `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).

View solution in original post

Astounding · Posted 01-09-2018 05:01 PM

Two steps to take ...

First, use a delimiter as part of the identifier. Otherwise, you can get data coming in like:

123 AB C

123 A BC

They will look the same when you put them together.

Second, make the comparison (greater than, less than) before constructing the identifier. For example:

if client1 < client2 then unique_identifier = catx('|', manager, client1, client2);

else unique_identifier = catx('|', manager, client2, client1);

snoopy369 · Posted 01-09-2018 05:02 PM

CALL SORTC will sort a list of character variables _on that row_ alphabetically. So in this case this is quite useful.

If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like

id1 = ABC

id2 = DEF

vs

id1 = AB

id2 = CDEF

those cats'ed together are identical, but aren't actually.

Here's an example preserving client1/client2 order.

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 $ client2 $;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.temporary;
  set have_actual;
  array client[2];
  array cl_temp[2] $ _temporary_;
  do _i = 1 to dim(cl_temp);
    cl_temp[_i] = client[_i];
  end;
  call sortc(of cl_temp[*]);
  client_unique = catx('|',of cl_temp[*]);
run;

proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;

And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables. You could just as easily do what you did above, effectively, just as easily with character variables as numeric - with the same possible pitfalls, of course. `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).

Yegen · Posted 01-09-2018 06:07 PM

Thank you for your replies, @snoopy369 and @Astounding. I really liked the array solution that @snoopy369 proposed. I appreciate your suggestions.

Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

Re: Remove duplicates (based on non-numeric identifier)

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away