Re: Name matching

NN · Posted 08-02-2012 02:34 PM

Hi,

I was trying to use the Compged function to match 3000 names against a 9million base... The base does have an index on the name...

Does any one have and suggestions on how i could improve the performance because as of now the query takes up a lot of time...

Or if someone could guide to some better name matching logics..

Thanks..

art297 · Posted 08-02-2012 02:38 PM

It would help to see your current code and small samples of the two files.

NN · Posted 08-02-2012 02:46 PM

Hi Art..

Its quite simple i have a small table of 3000 names and a large table of close to 10 mn records with name : $200 and UID : 20.

an Index has been created on the Name field.

my query is like

proc sql;

select a.* , b.*

from small_table a,

big_table b

where compged(a,b) <=100;

quit;

Will this info do?

TomKari · Posted 08-02-2012 03:02 PM

I'm pretty sure that with this syntax every record from small_table is being matched against every record in big_table (this is called a cartesian join), in other words you're executing 30 billion comparisons, which is absolutely brutal.

This is definitely a case of needing a different approach...I haven't done this before, but I'm sure that someone will have an excellent suggestion soon.

Tom

PGStats · Posted 08-02-2012 03:52 PM

You could solve the case where compged() = 0 first to reduce the size of the problem, along the line of :

proc sql;

create table exMatch as
select a.name as exName, b.*
from small_table as a inner join big_table as b
on a.name = b.name;

create table appMatch as
select a.name as appName , b.*
from small_table as a, big_table as b
where a.name not in (select name from exMatch) and
compged(a.name, b.name, 110) <= 100;

select * from exMatch
union
select * from appMatch;

quit;

PG

TomKari · Posted 08-02-2012 04:02 PM

Can you get decent results by using the SOUNDEX function on each of your strings, and then doing a direct match? That would greatly reduce the resources required.

Tom

mkeintz · Posted 08-02-2012 04:23 PM

Assume you first look for exact matches as was suggested and drop, say 1,000 from your original 3,000.

That's a big help but you will still be doing compged for the remaining 2,000 * 10M possible matches - that's 20B compged's.

And compged is expensive, so you want to avoid doing superfluous calculations (i.e. why do compged if one name has over 3 more letters than another?). So perhaps you could calculate

nletrs=length(compress(name,' ')) for names in both tables.

Then do compged only when a.nletrs is between b.nletrs-3 and b.nletrs+3

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FriedEgg · Posted 08-02-2012 06:05 PM

data new;

input name $;

cards;

ALIX

MATT

ROB

CHUNJIE

;

run;

data base(index=(name soundex));

input name $;

soundex=soundex(name);

cards;

ROB

JAMES

JOHN

ROBERT

MICHAEL

WILLIAM

DAVID

RICHARD

CHARLES

JOSEPH

THOMAS

DANIEL

PAUL

MARK

DONALD

ROBBY

GEORGE

KENNETH

STEVEN

EDWARD

BRIAN

RONALD

ANTHONY

KEVIN

JASON

MATTHEW

GARY

TIMOTHY

JOSE

LARRY

JEFFREY

ALEX

;

run;

proc sql _method;

create table want as

select a.name as new_name

, b.name as base_name

, 'exact' as match_type

from new a

, base b

where a.name=b.name

union all

select a.name as new_name

, b.name as base_name

, 'phonetic' as match_type

from new a

, base b

where a.name=*b.name

union all

select a.name as new_name

, b.name as base_name

, 'soundex' as match_type

from new a

, base b

where soundex(a.name)=b.soundex

union all

select a.name as new_name

, b.name as base_name

, 'spelldex' as match_type

from new a

, base b

where cats(substr(a.name,1,1),substr(compress(strip(a.name),'AEIOUY','p'),2))=cats(substr(b.name,1,1),substr(compress(strip(b.name),'AEIOUY','p'),2))

union all

select a.name as new_name

, b.name as base_name

, 'compged' as match_type

from new a

, base b

where compged(a.name,b.name)<=100;

quit;

ROB	ROB	exact
ALIX	ALEX	phonetic
MATT	MATTHEW	phonetic
ROB	ROB	phonetic
ROB	ROBBY	phonetic
ALIX	ALEX	soundex
MATT	MATTHEW	soundex
ROB	ROB	soundex
ROB	ROBBY	soundex
ROB	ROB	spelldex
ALIX	ALEX	spelldex
ALIX	ALEX	compged
MATT	MATTHEW	compged
ROB	ROB	compged
ROB	ROBERT	compged
ROB	ROBBY	compged

NN · Posted 08-07-2012 01:54 PM

Thank you all..

Using soundex before compged sounds promising.. shall try out all solutions provided and use the best fit...

Registration is open

SAS Training: Just a Click Away