topic Re: Fuzzy match 2 datasets in SAS Data Management

Fuzzy match 2 datasets

ernie86 — Wed, 30 May 2018 22:01:19 GMT

I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB has about 77,000 records. The program is still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script.

data group_AandB;
set groupA;
tmp_carf_name=soundex(Company_Name);
tmp_carf_address=soundex(Address_1);
tmp_carf_city=soundex(City);
tmp_carf_state=soundex(State);
do i=1 to nobs;
set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs;
tmp_pdr_name=soundex(ORG_NAME_ACTUAL);
tmp_pdr_address=soundex(ADDRESS);
tmp_pdr_city=soundex(CITY1);
tmp_pdr_state=soundex(STATE_CODE);

dif1=compged(tmp_carf_name, tmp_pdr_name);
dif2=compged(tmp_carf_address, tmp_pdr_address);
dif3=compged(tmp_carf_city, tmp_pdr_city);
dif4=compged(tmp_carf_state, tmp_pdr_state);

if dif1<=100 and dif2<=100 and dif3<=100 and dif4<=1 then do;
drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address
tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state
dif1 dif2 dif3;
output;
end;end;
run;

Re: Fuzzy match 2 datasets

TomKari — Wed, 30 May 2018 22:24:17 GMT

You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.

Tom

Re: Fuzzy match 2 datasets

Astounding — Thu, 31 May 2018 01:19:08 GMT

If your objective is to find some sort of "closest match", I would suggest that you do not output every combination. Remember, if you output 1.7B observations, your next step has to process all of them.

Instead, remove the OUTPUT statement. In its place, calculate the "distance" and whenever the distance is closer than the previous "best" distance, replace a set of variables so that those variables hold the best match found so far.

Without an OUTPUT statement, that best match will be part of the observation when the looping ends, and will be output automatically.

Re: Fuzzy match 2 datasets

PGStats — Thu, 31 May 2018 05:01:03 GMT

soundex in groupB should be precalculated in a previous step instead of recalculated for every obs from groupA.Give them lengths identical to the original strings.
compged should specify a cutoff argument.
In your distance test, you could calculate the second compged (dif2) only if the first one (dif1) is less than 100, and so on for the other distances.
Another approach that could save a lot of comparisons would compare only cases where the first letters of all 4 fields match.

Re: Fuzzy match 2 datasets

Patrick — Thu, 31 May 2018 06:26:41 GMT

@ernie86

If you've got the SAS data quality server license (or any license which gives you access to DataFlux and DF... SAS data step functions) then I would first create match codes for your address data and then run all additional logic only for records with the same match codes.

Re: Fuzzy match 2 datasets

ernie86 — Thu, 31 May 2018 15:13:26 GMT

Yes, I'm trying to get the closet match. Can you tell me how to tweet my code to compare the distances and the get best distance?

Re: Fuzzy match 2 datasets

Astounding — Thu, 31 May 2018 16:56:44 GMT

Not really, no. That part is up to you, determining which is the closest. However, I can help. Once you have decided upon a formula to measure the distance, I can show you how to save just 22,000 observations (each with its closest match) instead of 1.7B observations.