BookmarkSubscribeRSS Feed
accesnh
Fluorite | Level 6

I'm attempting to use COMPGED to compare names in two different data sets.  However, there's a large portion of the data that has the names in reverse order.  In order for the COMPGED to be more accurate, I'd like to sort the names in alphabetical order, but i'm having no luck doing so.

 

Have:

NameSource1        NameSource2
JOHNSMITH          SMITHJOHN

 

Want:

Sort(NameSource1)        Sort(NameSource2)

HHIJMNOST                      HHIJMNOST

9 REPLIES 9
novinosrin
Tourmaline | Level 20

Hi @accesnh  

 


data have;
input (NameSource1        NameSource2) (:$20.);
cards;
JOHNSMITH          SMITHJOHN
;


data want;
set have;
array t(999) $;
array name(*) NameSource1 NameSource2;
do j=1 to dim(name);
call missing(of t(*));
do i=1 to length(name(j));
t(i)=char(name(j),i);
end;
call sortc(of t(*));
name(j)=cats(of t(*));
end;
drop t: i j;
run;
Astounding
PROC Star

If memory serves, SORTC also moves blanks to the left.  You might want to conclude with this:

 

name(j) = left(name(j) ) ;

novinosrin
Tourmaline | Level 20

excellent catch. spot on! noted 

 

@Astounding   I think I got lucky there because CAT group of functions removes leading and trailing blanks, nonetheless your point is very important for all. 

novinosrin
Tourmaline | Level 20


data have;
input (NameSource1        NameSource2) (:$20.);
cards;
JOHNSMITH          SMITHJOHN
;

data want;
set have;
array t(999) $1;
array name(*) NameSource1 NameSource2;
do j=1 to dim(name);
call missing(of t(*));
call pokelong(name(j),addrlong(t(1)),vlength(name(j)));
call sortc(of t(*));
name(j)=cats(of t(*));
end;
drop t: j;
run;
ballardw
Super User

What did you try?

 

Do your data sets have the same numbers of records? How you combine the two is going to have some impact.

 

Do you have differences in case? Sort order does treat capital and lower case letters as different orders.

 

How many records are we talking about in each of these sets?

accesnh
Fluorite | Level 6

There is the same amount of records in each data set.  Also, I used UPPER to make everything the same case as well as compressed the data to remove any special characters.  Right now, each data set contains ~20,000 records.

 

I've never had to do something like this so I've been searching various online resources all day, but have yet to find a solution.

ballardw
Super User

@accesnh wrote:

There is the same amount of records in each data set.  Also, I used UPPER to make everything the same case as well as compressed the data to remove any special characters.  Right now, each data set contains ~20,000 records.

 

I've never had to do something like this so I've been searching various online resources all day, but have yet to find a solution.


I will not make any claim to efficiency. The following compares all the names in one set to the other set. So don't bother sorting before hand. The options 'IL' in the compged tells SAS to ignore case and removes blanks before the comparison.

The order provides the data in ascending values of the comparison, meaning less likely matches.

The data step selects the first as the likely closest match.

data set1;
   input name1 $;

datalines;
john
sally
fred
mary
;
run;

data set2;
   input name2 $;

datalines;
fred
johny
sal
marianne
maria 
marcy
;
run;

proc sql;
   create table temp as
   select a.name1, b.name2, compged(a.name1,b.name,'IL') as comp
   from set1 as a, set2 as b
   order by a.name1, comp
   ;
quit;

data want;
   set temp;
   by name1;
   if first.name1;
run;
FreelanceReinh
Jade | Level 19

Note that there is a risk of making different names similar or even identical by sorting their letters: See examples in the discussion of a similar question last November (https://communities.sas.com/t5/SAS-Programming/deleting-mixed-duplicates/m-p/510648). Sorting the names (like "Smith John" --> "John Smith") might be safer.

Reeza
Super User

The trick to something like this is to do the lookups conditionally - assuming it's a look up of sorts.

 

First do an exact match comparison. If the matches are made, remove the records and then proceed with fuzzy matching attempts. At each stage, pull out matched items so you're reducing the # of records that need to be processed each time. 

 


@accesnh wrote:

I'm attempting to use COMPGED to compare names in two different data sets.  However, there's a large portion of the data that has the names in reverse order.  In order for the COMPGED to be more accurate, I'd like to sort the names in alphabetical order, but i'm having no luck doing so.

 

Have:

NameSource1        NameSource2
JOHNSMITH          SMITHJOHN

 

Want:

Sort(NameSource1)        Sort(NameSource2)

HHIJMNOST                      HHIJMNOST


 

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1456 views
  • 4 likes
  • 6 in conversation