I'm attempting to use COMPGED to compare names in two different data sets. However, there's a large portion of the data that has the names in reverse order. In order for the COMPGED to be more accurate, I'd like to sort the names in alphabetical order, but i'm having no luck doing so.
Have:
NameSource1 NameSource2
JOHNSMITH SMITHJOHN
Want:
Sort(NameSource1) Sort(NameSource2)
HHIJMNOST HHIJMNOST
Hi @accesnh
data have;
input (NameSource1 NameSource2) (:$20.);
cards;
JOHNSMITH SMITHJOHN
;
data want;
set have;
array t(999) $;
array name(*) NameSource1 NameSource2;
do j=1 to dim(name);
call missing(of t(*));
do i=1 to length(name(j));
t(i)=char(name(j),i);
end;
call sortc(of t(*));
name(j)=cats(of t(*));
end;
drop t: i j;
run;
If memory serves, SORTC also moves blanks to the left. You might want to conclude with this:
name(j) = left(name(j) ) ;
excellent catch. spot on! noted
@Astounding I think I got lucky there because CAT group of functions removes leading and trailing blanks, nonetheless your point is very important for all.
data have;
input (NameSource1 NameSource2) (:$20.);
cards;
JOHNSMITH SMITHJOHN
;
data want;
set have;
array t(999) $1;
array name(*) NameSource1 NameSource2;
do j=1 to dim(name);
call missing(of t(*));
call pokelong(name(j),addrlong(t(1)),vlength(name(j)));
call sortc(of t(*));
name(j)=cats(of t(*));
end;
drop t: j;
run;
What did you try?
Do your data sets have the same numbers of records? How you combine the two is going to have some impact.
Do you have differences in case? Sort order does treat capital and lower case letters as different orders.
How many records are we talking about in each of these sets?
There is the same amount of records in each data set. Also, I used UPPER to make everything the same case as well as compressed the data to remove any special characters. Right now, each data set contains ~20,000 records.
I've never had to do something like this so I've been searching various online resources all day, but have yet to find a solution.
@accesnh wrote:
There is the same amount of records in each data set. Also, I used UPPER to make everything the same case as well as compressed the data to remove any special characters. Right now, each data set contains ~20,000 records.
I've never had to do something like this so I've been searching various online resources all day, but have yet to find a solution.
I will not make any claim to efficiency. The following compares all the names in one set to the other set. So don't bother sorting before hand. The options 'IL' in the compged tells SAS to ignore case and removes blanks before the comparison.
The order provides the data in ascending values of the comparison, meaning less likely matches.
The data step selects the first as the likely closest match.
data set1; input name1 $; datalines; john sally fred mary ; run; data set2; input name2 $; datalines; fred johny sal marianne maria marcy ; run; proc sql; create table temp as select a.name1, b.name2, compged(a.name1,b.name,'IL') as comp from set1 as a, set2 as b order by a.name1, comp ; quit; data want; set temp; by name1; if first.name1; run;
Note that there is a risk of making different names similar or even identical by sorting their letters: See examples in the discussion of a similar question last November (https://communities.sas.com/t5/SAS-Programming/deleting-mixed-duplicates/m-p/510648). Sorting the names (like "Smith John" --> "John Smith") might be safer.
The trick to something like this is to do the lookups conditionally - assuming it's a look up of sorts.
First do an exact match comparison. If the matches are made, remove the records and then proceed with fuzzy matching attempts. At each stage, pull out matched items so you're reducing the # of records that need to be processed each time.
@accesnh wrote:
I'm attempting to use COMPGED to compare names in two different data sets. However, there's a large portion of the data that has the names in reverse order. In order for the COMPGED to be more accurate, I'd like to sort the names in alphabetical order, but i'm having no luck doing so.
Have:
NameSource1 NameSource2
JOHNSMITH SMITHJOHN
Want:
Sort(NameSource1) Sort(NameSource2)
HHIJMNOST HHIJMNOST
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.