BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yegen
Pyrite | Level 9

 

I am trying to reduce duplicates in my dataset. Although the identifier is effectively the same, it is in based revered client numbers. In other words, if we would have replaced the client numbers in the above example for row 2 (in the have dataset), then the uniqueness identifier would be the same. My code works well when the client numbers are numerically, but unfortunately, my actual client numbers are strings. Here is an example where my code works well:

 

data work.have_example;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123 456
 6363 456 123
;
run;

data work.want_example;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123 456
;
run;


*Here is what I have tried to do;

data temp1;
 set have; 
 uniqueness_identifier1=CATS(manager,client1,client2);
 uniqueness_identifier2=CATS(manager,client2,client1);
run;

data temp1;
 set temp1;
 if uniqueness_identifier1>uniqueness_identifier2 then uniqueness_identifier3=uniqueness_identifier1; else uniqueness_identifier3=uniqueness_identifier2;
run;

proc sort data=temp1 out=temp2 nodupkey;
 by uniqueness_identifier3;
quit;


Below is what my actual data looks like and what I would need. I am stuck how to remove the duplicates when the client numbers are non-numeric. Any feedback will be greatly appreciate. 

 

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.want_actual;
 infile cards expandtabs truncover;
 input manager client1 client2;
 cards;
 6363 123A 45C6C
;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
snoopy369
Barite | Level 11

CALL SORTC will sort a list of character variables _on that row_ alphabetically.  So in this case this is quite useful.

 

If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like

 

id1 = ABC

id2 = DEF

 

vs

 

id1 = AB

id2 = CDEF

 

those cats'ed together are identical, but aren't actually.

 

Here's an example preserving client1/client2 order.

 

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 $ client2 $;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.temporary;
  set have_actual;
  array client[2];
  array cl_temp[2] $ _temporary_;
  do _i = 1 to dim(cl_temp);
    cl_temp[_i] = client[_i];
  end;
  call sortc(of cl_temp[*]);
  client_unique = catx('|',of cl_temp[*]);
run;

proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;

 

And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables.  You could just as easily do what you did above, effectively, just as easily with  character variables as numeric - with the same possible pitfalls, of course.  `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).

View solution in original post

3 REPLIES 3
Astounding
PROC Star

Two steps to take ...

 

First, use a delimiter as part of the identifier.  Otherwise, you can get data coming in like:

 

123 AB C

123 A BC

 

They will look the same when you put them together.

 

Second, make the comparison (greater than, less than) before constructing the identifier.  For example:

 

if client1 < client2 then unique_identifier = catx('|', manager, client1, client2);

else unique_identifier = catx('|', manager, client2, client1);

 

 

snoopy369
Barite | Level 11

CALL SORTC will sort a list of character variables _on that row_ alphabetically.  So in this case this is quite useful.

 

If you don't care about preserving client1/client2, then you can just do it directly to the variables and then proceed with your CAT strategy (I like CATX with a delimiter because it prevents from issues like

 

id1 = ABC

id2 = DEF

 

vs

 

id1 = AB

id2 = CDEF

 

those cats'ed together are identical, but aren't actually.

 

Here's an example preserving client1/client2 order.

 

data work.have_actual;
 infile cards expandtabs truncover;
 input manager client1 $ client2 $;
 cards;
 6363 123A 45C6C
 6363 45C6C 123A
;
run;

data work.temporary;
  set have_actual;
  array client[2];
  array cl_temp[2] $ _temporary_;
  do _i = 1 to dim(cl_temp);
    cl_temp[_i] = client[_i];
  end;
  call sortc(of cl_temp[*]);
  client_unique = catx('|',of cl_temp[*]);
run;

proc sort nodupkey data=work.temporary out=want;
by client_unique;
run;

 

And as Astounding notes indirectly: `>` (greater than) is perfectly fine to use for character variables.  You could just as easily do what you did above, effectively, just as easily with  character variables as numeric - with the same possible pitfalls, of course.  `>` compares the ascii codepage value (or whatever codepage you are using), so `Z` > `A`, lowercase letters are greater than uppercase letters, all letters are greater than all numeric digits, and the various other symbols are all over the place (but consistently for any given symbol).

Yegen
Pyrite | Level 9

Thank you for your replies, @snoopy369 and @Astounding. I really liked the array solution that @snoopy369 proposed. I appreciate your suggestions. 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1145 views
  • 2 likes
  • 3 in conversation