Hello - I've checked the forum for an answer to my question but no luck so far. Thank you all in advance for any help you can provide and my apologies if a post exists already that I missed. I have a list of project names, and I want to search for similar project names in the same dataset. My current method involves using a macro and a CALL EXECUTE with the list of projects, and a proc append to keep only those records with close matches. I've put simplified code below and have a simplified HAVE table. I have BASE and STAT SAS. While my current method does work, it takes a long time to execute (100k unique projects being searched). Does anyone have any recommendations to create more efficient code? Current method takes HOURS to run. From the searches I've done, HASH tables might be an option, or some kind of indexing, but I'm not sure how to create that properly. Thank you for your help!
Simplified Current Code:
%MACRO MNAME (PROEJCT);
DATA TEST;
SET FULL_LIST;
FUZZY = COMPGED("&project.",PROJ_NAME);
IF FUZZY < 1000;
RUN;
%MEND;
DATA _NULL_;
SET full_list;
CALL EXECUTE ('%MNAME ('||TRIM(project_name||');');
RUN;
HAVE:
data have;
infile datalines dlm ='09'x;
input company $ proj_name $25.;
datalines;
A CHOC CHIP RECIPE
B CHOCOLATE CHIP RECIPE
C SUGAR
D OATMEAL RAISIN
;
RUN;
WANT:
Lines 1 / 2 are similar
C - no project matches
D - no project matches
Unfortunately I do not have DQ Studio. Yes, this would be an insane amount of comparisons. Before posting I added a bit of code that would remove records I've already searched from the population, but that removes 1 record from my search list at a time. Not the most efficient either.
Would the SQL version run any faster?:
data have;
input company $ proj_name $25.;
datalines;
A CHOC_CHIP_RECIPE
B CHOCOLATE_CHIP_RECIPE
C SUGAR
D OATMEAL_RAISIN
;
RUN;
proc sql;
select a.proj_name as list1, b.proj_name as list2, compged(a.proj_name, b.proj_name) as fuzzy_match
from have a, have b
where a.proj_name <> b.proj_name and compged(a.proj_name, b.proj_name) <600;
run;
Output is
list1 | list2 | fuzzy_match |
CHOC_CHIP_RECIPE | CHOCOLATE_CHIP_RECIPE | 500 |
CHOCOLATE_CHIP_RECIPE | CHOC_CHIP_RECIPE | 500 |
I might suggest describing "similar" in some more detail. Since your example happens to list 4 types of cookies, I might say that they are all similar.
Without a more concrete example, I might look to see if there common contractions or spelling differences and try standardizing those.
I used to deal with some of our local department of education data and had to match schools from year to year. Amazingly even when the name didn't change the spelling in the records available would. "HS" "High School" "High" "Sr High" just for a short list. Making a new new name to replace all of those with standard spelling such as "HS" improved the match rates.
Order of changes can matter since some of the data would have "Junior High" so replacing "High" alone would be wrong.
With standard spellings in both you might improve the "exact" match rate as @Reeza suggested and then have much smaller fuzzy match pool to play with.
You might also look into COMPLEV as it supposedly executes quicker than COMPGED
I want to thank those who have posted so far. It's been a helpful discussion. I'm going to look more into breaking the population down more to reduce the size of the search and see how that can help.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.