Solved: Re: Search Project List for Similar Names (Fuzzy match and Search)

adornodj · Posted 03-01-2022 03:31 PM

Hello - I've checked the forum for an answer to my question but no luck so far. Thank you all in advance for any help you can provide and my apologies if a post exists already that I missed. I have a list of project names, and I want to search for similar project names in the same dataset. My current method involves using a macro and a CALL EXECUTE with the list of projects, and a proc append to keep only those records with close matches. I've put simplified code below and have a simplified HAVE table. I have BASE and STAT SAS. While my current method does work, it takes a long time to execute (100k unique projects being searched). Does anyone have any recommendations to create more efficient code? Current method takes HOURS to run. From the searches I've done, HASH tables might be an option, or some kind of indexing, but I'm not sure how to create that properly. Thank you for your help!

Simplified Current Code:

%MACRO MNAME (PROEJCT);

DATA TEST;
SET FULL_LIST;
FUZZY = COMPGED("&project.",PROJ_NAME);
IF FUZZY < 1000;
RUN;

%MEND;

DATA _NULL_;
SET full_list;
CALL EXECUTE ('%MNAME ('||TRIM(project_name||');');
RUN;

HAVE:

data have;
infile datalines dlm ='09'x;
input company $ proj_name $25.;
datalines;
A CHOC CHIP RECIPE
B CHOCOLATE CHIP RECIPE
C SUGAR
D OATMEAL RAISIN
;
RUN;

WANT:

Lines 1 / 2 are similar

C - no project matches

D - no project matches

Reeza · Posted 03-01-2022 03:48 PM

When you do fuzzy matching you compare every observation against every other observation.
That's 100,000*100,000 comparisons or 1*10^10 or about 10 billion if my math is right (it may not be). Point being - fuzzy matching is time intensive.

Usually the recommendation is to first do an exact match and remove those records to simplify the analysis.
Then go into fuzzy matching.

FYI - do you have SAS DQ studio? It does a good job at this type of problems.

View solution in original post

Reeza · Posted 03-01-2022 03:48 PM

When you do fuzzy matching you compare every observation against every other observation.
That's 100,000*100,000 comparisons or 1*10^10 or about 10 billion if my math is right (it may not be). Point being - fuzzy matching is time intensive.

Usually the recommendation is to first do an exact match and remove those records to simplify the analysis.
Then go into fuzzy matching.

FYI - do you have SAS DQ studio? It does a good job at this type of problems.

adornodj · Posted 03-01-2022 04:00 PM

Unfortunately I do not have DQ Studio. Yes, this would be an insane amount of comparisons. Before posting I added a bit of code that would remove records I've already searched from the population, but that removes 1 record from my search list at a time. Not the most efficient either.

HB · Posted 03-01-2022 04:53 PM

Would the SQL version run any faster?:

data have;
input company $ proj_name $25.;
datalines;
A CHOC_CHIP_RECIPE
B CHOCOLATE_CHIP_RECIPE
C SUGAR
D OATMEAL_RAISIN
;
RUN;

proc sql;
	select a.proj_name as list1, b.proj_name as list2, compged(a.proj_name, b.proj_name) as fuzzy_match
	from have a, have b
	where a.proj_name <> b.proj_name and compged(a.proj_name, b.proj_name) <600;
run;

Output is

list1	list2	fuzzy_match
CHOC_CHIP_RECIPE	CHOCOLATE_CHIP_RECIPE	500
CHOCOLATE_CHIP_RECIPE	CHOC_CHIP_RECIPE	500

Reeza · Posted 03-01-2022 04:55 PM

SQL will likely run out of memory with 100,000 rows and a cross join.

HB · Posted 03-01-2022 05:35 PM

@Reeza Probably.

ballardw · Posted 03-01-2022 06:12 PM

I might suggest describing "similar" in some more detail. Since your example happens to list 4 types of cookies, I might say that they are all similar.

Without a more concrete example, I might look to see if there common contractions or spelling differences and try standardizing those.

I used to deal with some of our local department of education data and had to match schools from year to year. Amazingly even when the name didn't change the spelling in the records available would. "HS" "High School" "High" "Sr High" just for a short list. Making a new new name to replace all of those with standard spelling such as "HS" improved the match rates.

Order of changes can matter since some of the data would have "Junior High" so replacing "High" alone would be wrong.

With standard spellings in both you might improve the "exact" match rate as @Reeza suggested and then have much smaller fuzzy match pool to play with.

You might also look into COMPLEV as it supposedly executes quicker than COMPGED

adornodj · Posted 03-01-2022 10:12 PM

I want to thank those who have posted so far. It's been a helpful discussion. I'm going to look more into breaking the population down more to reduce the size of the search and see how that can help.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!