I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB has about 77,000 records. The program is still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script.
data group_AandB; set groupA; tmp_carf_name=soundex(Company_Name); tmp_carf_address=soundex(Address_1); tmp_carf_city=soundex(City); tmp_carf_state=soundex(State); do i=1 to nobs; set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs; tmp_pdr_name=soundex(ORG_NAME_ACTUAL); tmp_pdr_address=soundex(ADDRESS); tmp_pdr_city=soundex(CITY1); tmp_pdr_state=soundex(STATE_CODE);
dif1=compged(tmp_carf_name, tmp_pdr_name); dif2=compged(tmp_carf_address, tmp_pdr_address); dif3=compged(tmp_carf_city, tmp_pdr_city); dif4=compged(tmp_carf_state, tmp_pdr_state); if dif1<=100 and dif2<=100 and dif3<=100 and dif4<=1 then do; drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state dif1 dif2 dif3; output; end;end; run;
... View more