SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Fuzzy match 2 datasets

Reply
Contributor
Posts: 26

Fuzzy match 2 datasets

I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB has about 77,000 records. The program is still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script. 

 

data group_AandB;
set groupA;
tmp_carf_name=soundex(Company_Name);
tmp_carf_address=soundex(Address_1);
tmp_carf_city=soundex(City);
tmp_carf_state=soundex(State);
do i=1 to nobs;
set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs;
tmp_pdr_name=soundex(ORG_NAME_ACTUAL);
tmp_pdr_address=soundex(ADDRESS);
tmp_pdr_city=soundex(CITY1);
tmp_pdr_state=soundex(STATE_CODE);

dif1=compged(tmp_carf_name, tmp_pdr_name);
dif2=compged(tmp_carf_address, tmp_pdr_address);
dif3=compged(tmp_carf_city, tmp_pdr_city);
dif4=compged(tmp_carf_state, tmp_pdr_state);

if dif1<=100 and dif2<=100 and dif3<=100 and dif4<=1 then do;
drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address
tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state
dif1 dif2 dif3;
output;
end;end;
run;

PROC Star
Posts: 1,288

Re: Fuzzy match 2 datasets

You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.

 

Tom

Super User
Posts: 6,629

Re: Fuzzy match 2 datasets

If your objective is to find some sort of "closest match", I would suggest that you do not output every combination.  Remember, if you output 1.7B observations, your next step has to process all of them.

 

Instead, remove the OUTPUT statement.  In its place, calculate the "distance" and whenever the distance is closer than the previous "best" distance, replace a set of variables so that those variables hold the best match found so far.

 

Without an OUTPUT statement, that best match will be part of the observation when the looping ends, and will be output automatically.

Contributor
Posts: 26

Re: Fuzzy match 2 datasets

Posted in reply to Astounding

Yes, I'm trying to get the closet match. Can you tell me how to tweet my code to compare the distances and the get best distance?

Super User
Posts: 6,629

Re: Fuzzy match 2 datasets

Not really, no.  That part is up to you, determining which is the closest.  However, I can help.  Once you have decided upon a formula to measure the distance, I can show you how to save just 22,000 observations (each with its closest match) instead of 1.7B observations.

Esteemed Advisor
Posts: 5,479

Re: Fuzzy match 2 datasets

  • soundex in groupB should be precalculated in a previous step instead of recalculated for every obs from groupA.Give them lengths identical to the original strings.
  • compged should specify a cutoff argument.
  • In your distance test, you could calculate the second compged (dif2) only if the first one (dif1) is less than 100, and so on for the other distances.
  • Another approach that could save a lot of comparisons would compare only cases where the first letters of all 4 fields match.
PG
Respected Advisor
Posts: 4,672

Re: Fuzzy match 2 datasets

@ernie86

If you've got the SAS data quality server license (or any license which gives you access to DataFlux and DF... SAS data step functions) then I would first create match codes for your address data and then run all additional logic only for records with the same match codes.

Ask a Question
Discussion stats
  • 6 replies
  • 182 views
  • 3 likes
  • 5 in conversation