I am trying to reduce duplicates in my dataset. Although the identifier is effectively the same, it is in based revered client numbers. In other words, if we would have replaced the client numbers in the above example for row 2 (in the have dataset), then the uniqueness identifier would be the same. My code works well when the client numbers are numerically, but unfortunately, my actual client numbers are strings. Here is an example where my code works well:
data work.have_example;
infile cards expandtabs truncover;
input manager client1 client2;
cards;
6363 123 456
6363 456 123
;
run;
data work.want_example;
infile cards expandtabs truncover;
input manager client1 client2;
cards;
6363 123 456
;
run;
*Here is what I have tried to do;
data temp1;
set have;
uniqueness_identifier1=CATS(manager,client1,client2);
uniqueness_identifier2=CATS(manager,client2,client1);
run;
data temp1;
set temp1;
if uniqueness_identifier1>uniqueness_identifier2 then uniqueness_identifier3=uniqueness_identifier1; else uniqueness_identifier3=uniqueness_identifier2;
run;
proc sort data=temp1 out=temp2 nodupkey;
by uniqueness_identifier3;
quit;
Below is what my actual data looks like and what I would need. I am stuck how to remove the duplicates when the client numbers are non-numeric. Any feedback will be greatly appreciate.
data work.have_actual;
infile cards expandtabs truncover;
input manager client1 client2;
cards;
6363 123A 45C6C
6363 45C6C 123A
;
run;
data work.want_actual;
infile cards expandtabs truncover;
input manager client1 client2;
cards;
6363 123A 45C6C
;
run;
CALL SORTC will sort a list of character variables _on that row_ alphabetically. So in this case this is quite useful.
If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like
id1 = ABC
id2 = DEF
vs
id1 = AB
id2 = CDEF
those cats'ed together are identical, but aren't actually.
Here's an example preserving client1/client2 order.
data work.have_actual;
infile cards expandtabs truncover;
input manager client1 $ client2 $;
cards;
6363 123A 45C6C
6363 45C6C 123A
;
run;
data work.temporary;
set have_actual;
array client[2];
array cl_temp[2] $ _temporary_;
do _i = 1 to dim(cl_temp);
cl_temp[_i] = client[_i];
end;
call sortc(of cl_temp[*]);
client_unique = catx('|',of cl_temp[*]);
run;
proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;
And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables. You could just as easily do what you did above, effectively, just as easily with character variables as numeric - with the same possible pitfalls, of course. `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).
Two steps to take ...
First, use a delimiter as part of the identifier. Otherwise, you can get data coming in like:
123 AB C
123 A BC
They will look the same when you put them together.
Second, make the comparison (greater than, less than) before constructing the identifier. For example:
if client1 < client2 then unique_identifier = catx('|', manager, client1, client2);
else unique_identifier = catx('|', manager, client2, client1);
CALL SORTC will sort a list of character variables _on that row_ alphabetically. So in this case this is quite useful.
If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like
id1 = ABC
id2 = DEF
vs
id1 = AB
id2 = CDEF
those cats'ed together are identical, but aren't actually.
Here's an example preserving client1/client2 order.
data work.have_actual;
infile cards expandtabs truncover;
input manager client1 $ client2 $;
cards;
6363 123A 45C6C
6363 45C6C 123A
;
run;
data work.temporary;
set have_actual;
array client[2];
array cl_temp[2] $ _temporary_;
do _i = 1 to dim(cl_temp);
cl_temp[_i] = client[_i];
end;
call sortc(of cl_temp[*]);
client_unique = catx('|',of cl_temp[*]);
run;
proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;
And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables. You could just as easily do what you did above, effectively, just as easily with character variables as numeric - with the same possible pitfalls, of course. `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).
Thank you for your replies, @snoopy369 and @Astounding. I really liked the array solution that @snoopy369 proposed. I appreciate your suggestions.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.