DATA Step, Macro, Functions and more

Remove duplicates (based on non-numeric identifier)

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 119
Accepted Solution

Remove duplicates (based on non-numeric identifier)

[ Edited ]

 

I am trying to reduce duplicates in my dataset. Although the identifier is effectively the same, it is in based revered client numbers. In other words, if we would have replaced the client numbers in the above example for row 2 (in the have dataset), then the uniqueness identifier would be the same. My code works well when the client numbers are numerically, but unfortunately, my actual client numbers are strings. Here is an example where my code works well:

 

data work.have_example;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123 456
 6363 456 123
;
run;

data work.want_example;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123 456
;
run;


*Here is what I have tried to do;

data temp1;
 set have; 
 uniqueness_identifier1=CATS(manager,client1,client2);
 uniqueness_identifier2=CATS(manager,client2,client1);
run;

data temp1;
 set temp1;
 if uniqueness_identifier1>uniqueness_identifier2 then uniqueness_identifier3=uniqueness_identifier1; else uniqueness_identifier3=uniqueness_identifier2;
run;

proc sort data=temp1 out=temp2 nodupkey;
 by uniqueness_identifier3;
quit;


Below is what my actual data looks like and what I would need. I am stuck how to remove the duplicates when the client numbers are non-numeric. Any feedback will be greatly appreciate. 

 

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.want_actual;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123A 45C6C
;
run;

Accepted Solutions
Solution
‎01-09-2018 06:05 PM
Super Contributor
Posts: 320

Re: Remove duplicates (based on non-numeric identifier)

[ Edited ]

CALL SORTC will sort a list of character variables _on that row_ alphabetically.  So in this case this is quite useful.

 

If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like

 

id1 = ABC

id2 = DEF

 

vs

 

id1 = AB

id2 = CDEF

 

those cats'ed together are identical, but aren't actually.

 

Here's an example preserving client1/client2 order.

 

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 $ client2 $;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.temporary;
  set have_actual;
  array client[2];
  array cl_temp[2] $ _temporary_;
  do _i = 1 to dim(cl_temp);
    cl_temp[_i] = client[_i];
  end;
  call sortc(of cl_temp[*]);
  client_unique = catx('|',of cl_temp[*]);
run;

proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;

 

And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables.  You could just as easily do what you did above, effectively, just as easily with  character variables as numeric - with the same possible pitfalls, of course.  `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).

View solution in original post


All Replies
Super User
Posts: 6,644

Re: Remove duplicates (based on non-numeric identifier)

Two steps to take ...

 

First, use a delimiter as part of the identifier.  Otherwise, you can get data coming in like:

 

123 AB C

123 A BC

 

They will look the same when you put them together.

 

Second, make the comparison (greater than, less than) before constructing the identifier.  For example:

 

if client1 < client2 then unique_identifier = catx('|', manager, client1, client2);

else unique_identifier = catx('|', manager, client2, client1);

 

 

Solution
‎01-09-2018 06:05 PM
Super Contributor
Posts: 320

Re: Remove duplicates (based on non-numeric identifier)

[ Edited ]

CALL SORTC will sort a list of character variables _on that row_ alphabetically.  So in this case this is quite useful.

 

If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like

 

id1 = ABC

id2 = DEF

 

vs

 

id1 = AB

id2 = CDEF

 

those cats'ed together are identical, but aren't actually.

 

Here's an example preserving client1/client2 order.

 

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 $ client2 $;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.temporary;
  set have_actual;
  array client[2];
  array cl_temp[2] $ _temporary_;
  do _i = 1 to dim(cl_temp);
    cl_temp[_i] = client[_i];
  end;
  call sortc(of cl_temp[*]);
  client_unique = catx('|',of cl_temp[*]);
run;

proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;

 

And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables.  You could just as easily do what you did above, effectively, just as easily with  character variables as numeric - with the same possible pitfalls, of course.  `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).

Frequent Contributor
Posts: 119

Re: Remove duplicates (based on non-numeric identifier)

[ Edited ]
Posted in reply to snoopy369

Thank you for your replies, @snoopy369 and @Astounding. I really liked the array solution that @snoopy369 proposed. I appreciate your suggestions. 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 137 views
  • 2 likes
  • 3 in conversation