I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB has about 77,000 records. The program is still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script.
data group_AandB;
set groupA;
tmp_carf_name=soundex(Company_Name);
tmp_carf_address=soundex(Address_1);
tmp_carf_city=soundex(City);
tmp_carf_state=soundex(State);
do i=1 to nobs;
set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs;
tmp_pdr_name=soundex(ORG_NAME_ACTUAL);
tmp_pdr_address=soundex(ADDRESS);
tmp_pdr_city=soundex(CITY1);
tmp_pdr_state=soundex(STATE_CODE);
dif1=compged(tmp_carf_name, tmp_pdr_name);
dif2=compged(tmp_carf_address, tmp_pdr_address);
dif3=compged(tmp_carf_city, tmp_pdr_city);
dif4=compged(tmp_carf_state, tmp_pdr_state);
if dif1<=100 and dif2<=100 and dif3<=100 and dif4<=1 then do;
drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address
tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state
dif1 dif2 dif3;
output;
end;end;
run;
You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.
Tom
If your objective is to find some sort of "closest match", I would suggest that you do not output every combination. Remember, if you output 1.7B observations, your next step has to process all of them.
Instead, remove the OUTPUT statement. In its place, calculate the "distance" and whenever the distance is closer than the previous "best" distance, replace a set of variables so that those variables hold the best match found so far.
Without an OUTPUT statement, that best match will be part of the observation when the looping ends, and will be output automatically.
Yes, I'm trying to get the closet match. Can you tell me how to tweet my code to compare the distances and the get best distance?
Not really, no. That part is up to you, determining which is the closest. However, I can help. Once you have decided upon a formula to measure the distance, I can show you how to save just 22,000 observations (each with its closest match) instead of 1.7B observations.
If you've got the SAS data quality server license (or any license which gives you access to DataFlux and DF... SAS data step functions) then I would first create match codes for your address data and then run all additional logic only for records with the same match codes.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.