Re: Fuzzy match 2 datasets

ernie86 · Posted 05-30-2018 06:01 PM

I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB has about 77,000 records. The program is still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script.

data group_AandB;
set groupA;
tmp_carf_name=soundex(Company_Name);
tmp_carf_address=soundex(Address_1);
tmp_carf_city=soundex(City);
tmp_carf_state=soundex(State);
do i=1 to nobs;
set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs;
tmp_pdr_name=soundex(ORG_NAME_ACTUAL);
tmp_pdr_address=soundex(ADDRESS);
tmp_pdr_city=soundex(CITY1);
tmp_pdr_state=soundex(STATE_CODE);

dif1=compged(tmp_carf_name, tmp_pdr_name);
dif2=compged(tmp_carf_address, tmp_pdr_address);
dif3=compged(tmp_carf_city, tmp_pdr_city);
dif4=compged(tmp_carf_state, tmp_pdr_state);

if dif1<=100 and dif2<=100 and dif3<=100 and dif4<=1 then do;
drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address
tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state
dif1 dif2 dif3;
output;
end;end;
run;

TomKari · Posted 05-30-2018 06:24 PM

You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.

Tom

Astounding · Posted 05-30-2018 09:19 PM

If your objective is to find some sort of "closest match", I would suggest that you do not output every combination. Remember, if you output 1.7B observations, your next step has to process all of them.

Instead, remove the OUTPUT statement. In its place, calculate the "distance" and whenever the distance is closer than the previous "best" distance, replace a set of variables so that those variables hold the best match found so far.

Without an OUTPUT statement, that best match will be part of the observation when the looping ends, and will be output automatically.

ernie86 · Posted 05-31-2018 11:13 AM

Yes, I'm trying to get the closet match. Can you tell me how to tweet my code to compare the distances and the get best distance?

Astounding · Posted 05-31-2018 12:56 PM

Not really, no. That part is up to you, determining which is the closest. However, I can help. Once you have decided upon a formula to measure the distance, I can show you how to save just 22,000 observations (each with its closest match) instead of 1.7B observations.

PGStats · Posted 05-31-2018 01:01 AM

soundex in groupB should be precalculated in a previous step instead of recalculated for every obs from groupA.Give them lengths identical to the original strings.
compged should specify a cutoff argument.
In your distance test, you could calculate the second compged (dif2) only if the first one (dif1) is less than 100, and so on for the other distances.
Another approach that could save a lot of comparisons would compare only cases where the first letters of all 4 fields match.

PG

Patrick · Posted 05-31-2018 02:26 AM

@ernie86

If you've got the SAS data quality server license (or any license which gives you access to DataFlux and DF... SAS data step functions) then I would first create match codes for your address data and then run all additional logic only for records with the same match codes.