Custom Sorting

accesnh · Posted 04-16-2019 02:59 PM

I'm attempting to use COMPGED to compare names in two different data sets. However, there's a large portion of the data that has the names in reverse order. In order for the COMPGED to be more accurate, I'd like to sort the names in alphabetical order, but i'm having no luck doing so.

Have:

NameSource1 NameSource2
JOHNSMITH SMITHJOHN

Want:

Sort(NameSource1) Sort(NameSource2)

HHIJMNOST HHIJMNOST

novinosrin · Posted 04-16-2019 03:12 PM

Hi @accesnh


data have;
input (NameSource1        NameSource2) (:$20.);
cards;
JOHNSMITH          SMITHJOHN
;


data want;
set have;
array t(999) $;
array name(*) NameSource1 NameSource2;
do j=1 to dim(name);
call missing(of t(*));
do i=1 to length(name(j));
t(i)=char(name(j),i);
end;
call sortc(of t(*));
name(j)=cats(of t(*));
end;
drop t: i j;
run;

Astounding · Posted 04-16-2019 03:37 PM

If memory serves, SORTC also moves blanks to the left. You might want to conclude with this:

name(j) = left(name(j) ) ;

novinosrin · Posted 04-16-2019 03:38 PM

excellent catch. spot on! noted

@Astounding I think I got lucky there because CAT group of functions removes leading and trailing blanks, nonetheless your point is very important for all.

novinosrin · Posted 04-16-2019 03:53 PM



data have;
input (NameSource1        NameSource2) (:$20.);
cards;
JOHNSMITH          SMITHJOHN
;

data want;
set have;
array t(999) $1;
array name(*) NameSource1 NameSource2;
do j=1 to dim(name);
call missing(of t(*));
call pokelong(name(j),addrlong(t(1)),vlength(name(j)));
call sortc(of t(*));
name(j)=cats(of t(*));
end;
drop t: j;
run;

ballardw · Posted 04-16-2019 03:13 PM

What did you try?

Do your data sets have the same numbers of records? How you combine the two is going to have some impact.

Do you have differences in case? Sort order does treat capital and lower case letters as different orders.

How many records are we talking about in each of these sets?

accesnh · Posted 04-16-2019 03:16 PM

There is the same amount of records in each data set. Also, I used UPPER to make everything the same case as well as compressed the data to remove any special characters. Right now, each data set contains ~20,000 records.

I've never had to do something like this so I've been searching various online resources all day, but have yet to find a solution.

ballardw · Posted 04-16-2019 06:34 PM

@accesnh wrote:

There is the same amount of records in each data set. Also, I used UPPER to make everything the same case as well as compressed the data to remove any special characters. Right now, each data set contains ~20,000 records.

I've never had to do something like this so I've been searching various online resources all day, but have yet to find a solution.

I will not make any claim to efficiency. The following compares all the names in one set to the other set. So don't bother sorting before hand. The options 'IL' in the compged tells SAS to ignore case and removes blanks before the comparison.

The order provides the data in ascending values of the comparison, meaning less likely matches.

The data step selects the first as the likely closest match.

data set1;
   input name1 $;

datalines;
john
sally
fred
mary
;
run;

data set2;
   input name2 $;

datalines;
fred
johny
sal
marianne
maria 
marcy
;
run;

proc sql;
   create table temp as
   select a.name1, b.name2, compged(a.name1,b.name,'IL') as comp
   from set1 as a, set2 as b
   order by a.name1, comp
   ;
quit;

data want;
   set temp;
   by name1;
   if first.name1;
run;

FreelanceReinh · Posted 04-16-2019 06:16 PM

Note that there is a risk of making different names similar or even identical by sorting their letters: See examples in the discussion of a similar question last November (https://communities.sas.com/t5/SAS-Programming/deleting-mixed-duplicates/m-p/510648). Sorting the names (like "Smith John" --> "John Smith") might be safer.

Reeza · Posted 04-16-2019 06:49 PM

The trick to something like this is to do the lookups conditionally - assuming it's a look up of sorts.

First do an exact match comparison. If the matches are made, remove the records and then proceed with fuzzy matching attempts. At each stage, pull out matched items so you're reducing the # of records that need to be processed each time.

@accesnh wrote:

I'm attempting to use COMPGED to compare names in two different data sets. However, there's a large portion of the data that has the names in reverse order. In order for the COMPGED to be more accurate, I'd like to sort the names in alphabetical order, but i'm having no luck doing so.

Have:

NameSource1 NameSource2
JOHNSMITH SMITHJOHN

Want:

Sort(NameSource1) Sort(NameSource2)

HHIJMNOST HHIJMNOST

Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

Re: Custom Sorting

The 2025 SAS Hackathon has begun!