BookmarkSubscribeRSS Feed
ernie86
Calcite | Level 5

I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB has about 77,000 records. The program is still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script. 

 

data group_AandB;
set groupA;
tmp_carf_name=soundex(Company_Name);
tmp_carf_address=soundex(Address_1);
tmp_carf_city=soundex(City);
tmp_carf_state=soundex(State);
do i=1 to nobs;
set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs;
tmp_pdr_name=soundex(ORG_NAME_ACTUAL);
tmp_pdr_address=soundex(ADDRESS);
tmp_pdr_city=soundex(CITY1);
tmp_pdr_state=soundex(STATE_CODE);

dif1=compged(tmp_carf_name, tmp_pdr_name);
dif2=compged(tmp_carf_address, tmp_pdr_address);
dif3=compged(tmp_carf_city, tmp_pdr_city);
dif4=compged(tmp_carf_state, tmp_pdr_state);

if dif1<=100 and dif2<=100 and dif3<=100 and dif4<=1 then do;
drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address
tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state
dif1 dif2 dif3;
output;
end;end;
run;

6 REPLIES 6
TomKari
Onyx | Level 15

You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.

 

Tom

Astounding
PROC Star

If your objective is to find some sort of "closest match", I would suggest that you do not output every combination.  Remember, if you output 1.7B observations, your next step has to process all of them.

 

Instead, remove the OUTPUT statement.  In its place, calculate the "distance" and whenever the distance is closer than the previous "best" distance, replace a set of variables so that those variables hold the best match found so far.

 

Without an OUTPUT statement, that best match will be part of the observation when the looping ends, and will be output automatically.

ernie86
Calcite | Level 5

Yes, I'm trying to get the closet match. Can you tell me how to tweet my code to compare the distances and the get best distance?

Astounding
PROC Star

Not really, no.  That part is up to you, determining which is the closest.  However, I can help.  Once you have decided upon a formula to measure the distance, I can show you how to save just 22,000 observations (each with its closest match) instead of 1.7B observations.

PGStats
Opal | Level 21
  • soundex in groupB should be precalculated in a previous step instead of recalculated for every obs from groupA.Give them lengths identical to the original strings.
  • compged should specify a cutoff argument.
  • In your distance test, you could calculate the second compged (dif2) only if the first one (dif1) is less than 100, and so on for the other distances.
  • Another approach that could save a lot of comparisons would compare only cases where the first letters of all 4 fields match.
PG
Patrick
Opal | Level 21

@ernie86

If you've got the SAS data quality server license (or any license which gives you access to DataFlux and DF... SAS data step functions) then I would first create match codes for your address data and then run all additional logic only for records with the same match codes.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2247 views
  • 3 likes
  • 5 in conversation