Help using Base SAS procedures

Name matching

Reply
Regular Contributor
Regular Contributor
Posts: 166

Name matching

    Hi,

I was trying to use the Compged function to match 3000 names against a 9million base... The base does have an index on the name...

Does any one have and suggestions on how i could improve the performance because as of now the query takes up a lot of time...

Or if someone could guide to some better name matching logics..

Thanks..

PROC Star
Posts: 7,474

Re: Name matching

It would help to see your current code and small samples of the two files.

Regular Contributor
Regular Contributor
Posts: 166

Re: Name matching

Hi Art..

Its quite simple i have a small table of 3000 names and a large table of close to 10 mn records with name : $200   and UID :  20.

an Index has been created on the Name field.

my query is like

proc sql;

select a.* , b.*

from small_table a,

big_table b

where compged(a,b) <=100;

quit;

Will this info do?

PROC Star
Posts: 1,167

Re: Name matching

I'm pretty sure that with this syntax every record from small_table is being matched against every record in big_table (this is called a cartesian join), in other words you're executing 30 billion comparisons, which is absolutely brutal.

This is definitely a case of needing a different approach...I haven't done this before, but I'm sure that someone will have an excellent suggestion soon.

Tom

Respected Advisor
Posts: 4,925

Re: Name matching

You could solve the case where compged() = 0 first to reduce the size of the problem, along the line of :

proc sql;

create table exMatch as
select a.name as exName, b.*
from small_table as a inner join big_table as b
on a.name = b.name;

create table appMatch as
select a.name as appName , b.*
from small_table as a, big_table as b
where a.name not in (select name from exMatch) and
     compged(a.name, b.name, 110) <= 100;

select * from exMatch
union
select * from appMatch;

quit;

PG

PG
PROC Star
Posts: 1,167

Re: Name matching

Can you get decent results by using the SOUNDEX function on each of your strings, and then doing a direct match? That would greatly reduce the resources required.

Tom

Trusted Advisor
Posts: 1,019

Re: Name matching

Assume you first look for exact matches as was suggested and drop, say 1,000 from your original 3,000.

That's a big help but you will still be doing compged for the remaining 2,000 * 10M possible matches - that's 20B compged's.

And compged is expensive, so you want to avoid doing superfluous calculations (i.e. why do compged if one name has over 3 more letters than another?).  So perhaps you could calculate

    nletrs=length(compress(name,' ')) for names in both tables.

Then do compged only when   a.nletrs is between b.nletrs-3 and b.nletrs+3

Trusted Advisor
Posts: 1,301

Re: Name matching

data new;

input name $;

cards;

ALIX

MATT

ROB

CHUNJIE

;

run;

data base(index=(name soundex));

input name $;

soundex=soundex(name);

cards;

ROB

JAMES

JOHN

ROBERT

MICHAEL

WILLIAM

DAVID

RICHARD

CHARLES

JOSEPH

THOMAS

DANIEL

PAUL

MARK

DONALD

ROBBY

GEORGE

KENNETH

STEVEN

EDWARD

BRIAN

RONALD

ANTHONY

KEVIN

JASON

MATTHEW

GARY

TIMOTHY

JOSE

LARRY

JEFFREY

ALEX

;

run;

proc sql _method;

create table want as

select a.name as new_name

      , b.name as base_name

   , 'exact' as match_type

   from new a

      , base b

  where a.name=b.name

  union all

select a.name as new_name

      , b.name as base_name

   , 'phonetic' as match_type

   from new a

      , base b

  where a.name=*b.name

  union all

select a.name as new_name

      , b.name as base_name

   , 'soundex' as match_type

   from new a

      , base b

  where soundex(a.name)=b.soundex

  union all

select a.name as new_name

      , b.name as base_name

   , 'spelldex' as match_type

   from new a

      , base b

  where cats(substr(a.name,1,1),substr(compress(strip(a.name),'AEIOUY','p'),2))=cats(substr(b.name,1,1),substr(compress(strip(b.name),'AEIOUY','p'),2))

  union all

select a.name as new_name

      , b.name as base_name

   , 'compged' as match_type

   from new a

      , base b

  where compged(a.name,b.name)<=100;

quit;

ROBROBexact
ALIXALEXphonetic
MATTMATTHEWphonetic
ROBROBphonetic
ROBROBBYphonetic
ALIXALEXsoundex
MATTMATTHEWsoundex
ROBROBsoundex
ROBROBBYsoundex
ROBROBspelldex
ALIXALEXspelldex
ALIXALEXcompged
MATTMATTHEWcompged
ROBROBcompged
ROBROBERTcompged
ROBROBBYcompged
Regular Contributor
Regular Contributor
Posts: 166

Re: Name matching

Thank you all..

Using soundex before compged sounds promising.. shall try out all solutions provided and use the best fit...

Ask a Question
Discussion stats
  • 8 replies
  • 1285 views
  • 10 likes
  • 6 in conversation