BookmarkSubscribeRSS Feed
NN
Quartz | Level 8 NN
Quartz | Level 8

    Hi,

I was trying to use the Compged function to match 3000 names against a 9million base... The base does have an index on the name...

Does any one have and suggestions on how i could improve the performance because as of now the query takes up a lot of time...

Or if someone could guide to some better name matching logics..

Thanks..

8 REPLIES 8
art297
Opal | Level 21

It would help to see your current code and small samples of the two files.

NN
Quartz | Level 8 NN
Quartz | Level 8

Hi Art..

Its quite simple i have a small table of 3000 names and a large table of close to 10 mn records with name : $200   and UID :  20.

an Index has been created on the Name field.

my query is like

proc sql;

select a.* , b.*

from small_table a,

big_table b

where compged(a,b) <=100;

quit;

Will this info do?

TomKari
Onyx | Level 15

I'm pretty sure that with this syntax every record from small_table is being matched against every record in big_table (this is called a cartesian join), in other words you're executing 30 billion comparisons, which is absolutely brutal.

This is definitely a case of needing a different approach...I haven't done this before, but I'm sure that someone will have an excellent suggestion soon.

Tom

PGStats
Opal | Level 21

You could solve the case where compged() = 0 first to reduce the size of the problem, along the line of :

proc sql;

create table exMatch as
select a.name as exName, b.*
from small_table as a inner join big_table as b
on a.name = b.name;

create table appMatch as
select a.name as appName , b.*
from small_table as a, big_table as b
where a.name not in (select name from exMatch) and
     compged(a.name, b.name, 110) <= 100;

select * from exMatch
union
select * from appMatch;

quit;

PG

PG
TomKari
Onyx | Level 15

Can you get decent results by using the SOUNDEX function on each of your strings, and then doing a direct match? That would greatly reduce the resources required.

Tom

mkeintz
PROC Star

Assume you first look for exact matches as was suggested and drop, say 1,000 from your original 3,000.

That's a big help but you will still be doing compged for the remaining 2,000 * 10M possible matches - that's 20B compged's.

And compged is expensive, so you want to avoid doing superfluous calculations (i.e. why do compged if one name has over 3 more letters than another?).  So perhaps you could calculate

    nletrs=length(compress(name,' ')) for names in both tables.

Then do compged only when   a.nletrs is between b.nletrs-3 and b.nletrs+3

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FriedEgg
SAS Employee

data new;

input name $;

cards;

ALIX

MATT

ROB

CHUNJIE

;

run;

data base(index=(name soundex));

input name $;

soundex=soundex(name);

cards;

ROB

JAMES

JOHN

ROBERT

MICHAEL

WILLIAM

DAVID

RICHARD

CHARLES

JOSEPH

THOMAS

DANIEL

PAUL

MARK

DONALD

ROBBY

GEORGE

KENNETH

STEVEN

EDWARD

BRIAN

RONALD

ANTHONY

KEVIN

JASON

MATTHEW

GARY

TIMOTHY

JOSE

LARRY

JEFFREY

ALEX

;

run;

proc sql _method;

create table want as

select a.name as new_name

      , b.name as base_name

   , 'exact' as match_type

   from new a

      , base b

  where a.name=b.name

  union all

select a.name as new_name

      , b.name as base_name

   , 'phonetic' as match_type

   from new a

      , base b

  where a.name=*b.name

  union all

select a.name as new_name

      , b.name as base_name

   , 'soundex' as match_type

   from new a

      , base b

  where soundex(a.name)=b.soundex

  union all

select a.name as new_name

      , b.name as base_name

   , 'spelldex' as match_type

   from new a

      , base b

  where cats(substr(a.name,1,1),substr(compress(strip(a.name),'AEIOUY','p'),2))=cats(substr(b.name,1,1),substr(compress(strip(b.name),'AEIOUY','p'),2))

  union all

select a.name as new_name

      , b.name as base_name

   , 'compged' as match_type

   from new a

      , base b

  where compged(a.name,b.name)<=100;

quit;

ROBROBexact
ALIXALEXphonetic
MATTMATTHEWphonetic
ROBROBphonetic
ROBROBBYphonetic
ALIXALEXsoundex
MATTMATTHEWsoundex
ROBROBsoundex
ROBROBBYsoundex
ROBROBspelldex
ALIXALEXspelldex
ALIXALEXcompged
MATTMATTHEWcompged
ROBROBcompged
ROBROBERTcompged
ROBROBBYcompged
NN
Quartz | Level 8 NN
Quartz | Level 8

Thank you all..

Using soundex before compged sounds promising.. shall try out all solutions provided and use the best fit...

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 4856 views
  • 12 likes
  • 6 in conversation