Hi,
I was trying to use the Compged function to match 3000 names against a 9million base... The base does have an index on the name...
Does any one have and suggestions on how i could improve the performance because as of now the query takes up a lot of time...
Or if someone could guide to some better name matching logics..
Thanks..
It would help to see your current code and small samples of the two files.
Hi Art..
Its quite simple i have a small table of 3000 names and a large table of close to 10 mn records with name : $200 and UID : 20.
an Index has been created on the Name field.
my query is like
proc sql;
select a.* , b.*
from small_table a,
big_table b
where compged(a,b) <=100;
quit;
Will this info do?
I'm pretty sure that with this syntax every record from small_table is being matched against every record in big_table (this is called a cartesian join), in other words you're executing 30 billion comparisons, which is absolutely brutal.
This is definitely a case of needing a different approach...I haven't done this before, but I'm sure that someone will have an excellent suggestion soon.
Tom
You could solve the case where compged() = 0 first to reduce the size of the problem, along the line of :
proc sql;
create table exMatch as
select a.name as exName, b.*
from small_table as a inner join big_table as b
on a.name = b.name;
create table appMatch as
select a.name as appName , b.*
from small_table as a, big_table as b
where a.name not in (select name from exMatch) and
compged(a.name, b.name, 110) <= 100;
select * from exMatch
union
select * from appMatch;
quit;
PG
Can you get decent results by using the SOUNDEX function on each of your strings, and then doing a direct match? That would greatly reduce the resources required.
Tom
Assume you first look for exact matches as was suggested and drop, say 1,000 from your original 3,000.
That's a big help but you will still be doing compged for the remaining 2,000 * 10M possible matches - that's 20B compged's.
And compged is expensive, so you want to avoid doing superfluous calculations (i.e. why do compged if one name has over 3 more letters than another?). So perhaps you could calculate
nletrs=length(compress(name,' ')) for names in both tables.
Then do compged only when a.nletrs is between b.nletrs-3 and b.nletrs+3
data new;
input name $;
cards;
ALIX
MATT
ROB
CHUNJIE
;
run;
data base(index=(name soundex));
input name $;
soundex=soundex(name);
cards;
ROB
JAMES
JOHN
ROBERT
MICHAEL
WILLIAM
DAVID
RICHARD
CHARLES
JOSEPH
THOMAS
DANIEL
PAUL
MARK
DONALD
ROBBY
GEORGE
KENNETH
STEVEN
EDWARD
BRIAN
RONALD
ANTHONY
KEVIN
JASON
MATTHEW
GARY
TIMOTHY
JOSE
LARRY
JEFFREY
ALEX
;
run;
proc sql _method;
create table want as
select a.name as new_name
, b.name as base_name
, 'exact' as match_type
from new a
, base b
where a.name=b.name
union all
select a.name as new_name
, b.name as base_name
, 'phonetic' as match_type
from new a
, base b
where a.name=*b.name
union all
select a.name as new_name
, b.name as base_name
, 'soundex' as match_type
from new a
, base b
where soundex(a.name)=b.soundex
union all
select a.name as new_name
, b.name as base_name
, 'spelldex' as match_type
from new a
, base b
where cats(substr(a.name,1,1),substr(compress(strip(a.name),'AEIOUY','p'),2))=cats(substr(b.name,1,1),substr(compress(strip(b.name),'AEIOUY','p'),2))
union all
select a.name as new_name
, b.name as base_name
, 'compged' as match_type
from new a
, base b
where compged(a.name,b.name)<=100;
quit;
ROB | ROB | exact |
ALIX | ALEX | phonetic |
MATT | MATTHEW | phonetic |
ROB | ROB | phonetic |
ROB | ROBBY | phonetic |
ALIX | ALEX | soundex |
MATT | MATTHEW | soundex |
ROB | ROB | soundex |
ROB | ROBBY | soundex |
ROB | ROB | spelldex |
ALIX | ALEX | spelldex |
ALIX | ALEX | compged |
MATT | MATTHEW | compged |
ROB | ROB | compged |
ROB | ROBERT | compged |
ROB | ROBBY | compged |
Thank you all..
Using soundex before compged sounds promising.. shall try out all solutions provided and use the best fit...
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.