I took a look at your logic and there appears to be a couple of reasons that you are getting the output you presented in your post. 1. In your SCAN function statement, you do not specify any details regarding delimiters. Which means it will use the default list of delimiters. This will cause a problem, since some of the city names contain these delimiter characters. For example, 'Brighton & Hove' contains the ampersand character. So when it scans by X number of words from the back, it will first look for HOVE, and then on the next loop 'BRIGHTON', but never 'BRIGHTON AND HOVE' as needed. By making some slight modifications, we can eliminate this issue. 2.Additionally, I made updates to the values in the city table to replace blank spaces between words with underscores. These spaces were also causing city names to be broken down into smaller chunks that didn't align with the values in the person address field. 3.Different spellings/characters being used for the same city name between the two tables. - For example: On the city table there is, 'St Albans'. But on the company table, that city appears as, 'St. Albansn'. There is an additional 'n' at the end of the city name, a period after 'St' that does not exist for the value on the City table. So I cleaned the table name on the City table for this as well. Stoke-on-Trent also has a stray 'n' at the end of it on the Company table. These would be more database value issues that would need to be handled on that end as bad data entries etc. For the purposes of this code, I updated those two slightly to see how they could go through this code if spelled correctly. 4.In order to show all iterations, specifically when two city names appear in the same 'person address', to appear as two separate observations on the output the 'LEAVE' statement needed to be updated to an 'OUTPUT'. Here is the logic for the updated datasets as well as the updated city name identification Hash logic: DATA company;
LENGTH
company_name $ 24
country $ 2
person_address $ 64 ;
FORMAT
company_name $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFORMAT
company_name $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFILE DATALINES4
/* DLM='?'*/
DLM='7F'x
MISSOVER
DSD ;
INPUT
company_name : $CHAR24.
country : $CHAR2.
person_address : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RX
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trent, SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albans SW7 2JN
;;;;
run;
data city;
infile datalines truncover;
input city $50.;
datalines;
London
Aberdeen
Brighton & Hove
Kingston Upon Hull
Lancaster
Newcastle Upon Tyne
St. Albans
Stoke-on-Trent
;
run;
DATA WORK.CITY (RENAME=(CITY2=CITY));
SET WORK.CITY;
CITY2=TRANWRD(TRANWRD(TRANWRD(UPCASE(CITY),'. ','._'),' & ', '_&_'),' UPON ', '_UPON_');
DROP CITY;
RUN;
DATA SAMPLE(DROP=_:);
SET WORK.COMPANY;
IF _n_=1 THEN DO;
DCL HASH H1();
H1.DEFINEKEY('City');
H1.DEFINEDATA('City');
H1.DEFINEDONE();
DO UNTIL(DONE);
SET WORK.CITY END=DONE;
H1.REF(KEY:UPCASE(City), DATA:City);
END;
END;
CALL MISSING(City);
_Person_Addr_Upcase=TRANWRD(TRANWRD(TRANWRD(UPCASE(PERSON_ADDRESS),'. ','._'),' & ', '_&_'),' UPON ', '_UPON_');
_Word_Cnt=COUNTW(_Person_Addr_Upcase);
DO _i=_Word_Cnt TO 1 BY -1;
IF H1.FIND(KEY:(SCAN(_Person_Addr_Upcase,_i,', ')))=0 THEN OUTPUT;
END;
RUN; I hope this helps.
... View more