BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
adornodj
Obsidian | Level 7

Hello - I've checked the forum for an answer to my question but no luck so far.  Thank you all in advance for any help you can provide and my apologies if a post exists already that I missed.  I have a list of project names, and I want to search for similar project names in the same dataset.  My current method involves using a macro and a CALL EXECUTE with the list of projects, and a proc append to keep only those records with close matches.  I've put simplified code below and have a simplified HAVE table.  I have BASE and STAT SAS.  While my current method does work, it takes a long time to execute (100k unique projects being searched).  Does anyone have any recommendations to create more efficient code?  Current method takes HOURS to run.  From the searches I've done, HASH tables might be an option, or some kind of indexing, but I'm not sure how to create that properly.  Thank you for your help!

 

Simplified Current Code:

%MACRO MNAME (PROEJCT);

DATA TEST;
SET FULL_LIST;
FUZZY = COMPGED("&project.",PROJ_NAME);
IF FUZZY < 1000;
RUN;

%MEND;

DATA _NULL_;
SET full_list;
CALL EXECUTE ('%MNAME ('||TRIM(project_name||');');
RUN;

 

 

HAVE:

data have;
infile datalines dlm ='09'x;
input company $ proj_name $25.;
datalines;
A CHOC CHIP RECIPE
B CHOCOLATE CHIP RECIPE
C SUGAR
D OATMEAL RAISIN
;
RUN;

 

WANT: 

Lines 1 / 2 are similar

C - no project matches

D - no project matches

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
When you do fuzzy matching you compare every observation against every other observation.
That's 100,000*100,000 comparisons or 1*10^10 or about 10 billion if my math is right (it may not be). Point being - fuzzy matching is time intensive.

Usually the recommendation is to first do an exact match and remove those records to simplify the analysis.
Then go into fuzzy matching.

FYI - do you have SAS DQ studio? It does a good job at this type of problems.

View solution in original post

7 REPLIES 7
Reeza
Super User
When you do fuzzy matching you compare every observation against every other observation.
That's 100,000*100,000 comparisons or 1*10^10 or about 10 billion if my math is right (it may not be). Point being - fuzzy matching is time intensive.

Usually the recommendation is to first do an exact match and remove those records to simplify the analysis.
Then go into fuzzy matching.

FYI - do you have SAS DQ studio? It does a good job at this type of problems.
adornodj
Obsidian | Level 7

Unfortunately I do not have DQ Studio.  Yes, this would be an insane amount of comparisons.  Before posting I added a bit of code that would remove records I've already searched from the population, but that removes 1 record from my search list at a time.  Not the most efficient either.  

HB
Barite | Level 11 HB
Barite | Level 11

Would the SQL version run any faster?:

 

data have;
input company $ proj_name $25.;
datalines;
A CHOC_CHIP_RECIPE
B CHOCOLATE_CHIP_RECIPE
C SUGAR
D OATMEAL_RAISIN
;
RUN;

proc sql;
	select a.proj_name as list1, b.proj_name as list2, compged(a.proj_name, b.proj_name) as fuzzy_match
	from have a, have b
	where a.proj_name <> b.proj_name and compged(a.proj_name, b.proj_name) <600;
run;

Output is 

 

list1 list2 fuzzy_match
CHOC_CHIP_RECIPE CHOCOLATE_CHIP_RECIPE 500
CHOCOLATE_CHIP_RECIPE CHOC_CHIP_RECIPE 500
Reeza
Super User
SQL will likely run out of memory with 100,000 rows and a cross join.
ballardw
Super User

I might suggest describing "similar" in some more detail. Since your example happens to list 4 types of cookies, I might say that they are all similar.

 

Without a more concrete example, I might look to see if there common contractions or spelling differences and try standardizing those.

I used to deal with some of our local department of education data and had to match schools from year to year. Amazingly even when the name didn't change the spelling in the records available would. "HS" "High School" "High" "Sr High" just for a short list. Making a new new name to replace all of those with standard spelling such as "HS" improved the match rates.

Order of changes can matter since some of the data would have "Junior High" so replacing "High" alone would be wrong.

 

With standard spellings in both you might improve the "exact" match rate as @Reeza suggested and then have much smaller fuzzy match pool to play with.

 

You might also look into COMPLEV as it supposedly executes quicker than COMPGED

adornodj
Obsidian | Level 7

I want to thank those who have posted so far.  It's been a helpful discussion.  I'm going to look more into breaking the population down more to reduce the size of the search and see how that can help.  

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 813 views
  • 2 likes
  • 4 in conversation