Dear all,
I would like to extract the city name from Address variable. for following table 1,
company_nmae | country | person_address |
2K POLYMER SYSTEMS | GB | 21 Warple Way, Aberdeen X |
2K POLYMER SYSTEMS | GB | 21 Warple Way,Aberdeen, Greater London W3 0RX |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place Stoke-on-Trentn, SW7 2JN |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place, St. Albansn SW7 2JN |
I expect to get
company_nmae | country | person_address | City |
2K POLYMER SYSTEMS | GB | 21 Warple Way, Aberdeen X | Aberdeen |
2K POLYMER SYSTEMS | GB | 21 Warple Way,Aberdeen, Greater London W3 0RX | London |
2K POLYMER SYSTEMS | GB | 21 Warple Way,Aberdeen, Greater London W3 0RX | Aberdeen |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB | Brighton & Hove |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB | Kingston upon Hull |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place Stoke-on-Trentn, SW7 2JN | Stoke-on-Trentn |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place, St. Albansn SW7 2JN | St. Albansn |
I have two datasets, the COMPANY dataset includes person_address information, the CITY dataset includes city information. I would like to create a new variable in the COMPANY dataset and extract city name from 'person_address' variable based on 'city' variable (which stores in CITY dataset).
Especially, if a person_address stings both include two city name, I expect to extract both of them. for example for value '21 Warple Way,Aberdeen, Greater London W3 0RX', I expect to extract both London and Aberdeen.
However, by using the code like below,
DATA company;
LENGTH
company_nmae $ 24
country $ 2
person_address $ 64 ;
FORMAT
company_nmae $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFORMAT
company_nmae $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFILE DATALINES4
DLM='7F'x
MISSOVER
DSD ;
INPUT
company_nmae : $CHAR24.
country : $CHAR2.
person_address : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RX
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trentn, SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albansn SW7 2JN
;;;;
run;
data city;
infile datalines truncover;
input city $50.;
datalines;
London
Aberdeen
Brighton & Hove
Kingston upon Hull
Lancaster
Newcastle upon Tyne
St Albans
Stoke-on-Trent
;
run;
data sample(drop=_:);
set company;
if _n_=1 then
do;
dcl hash h1();
h1.defineKey('city');
h1.defineData('city');
h1.defineDone();
do until(done);
set city end=done;
h1.ref(key:upcase(city), data:city);
end;
end;
call missing(city);
_word_cnt=countw(person_address);
_person_addr_upcase=upcase(person_address);
do _i=_word_cnt to 1 by -1;
if h1.find(key:scan(_person_addr_upcase,_i))=0 then leave;
end;
run;
proc print data=sample;
run;
I can only get
2K POLYMER SYSTEMS | GB | 21 Warple Way, Aberdeen X | Aberdeen |
2K POLYMER SYSTEMS | GB | 21 Warple Way,Aberdeen, Greater London W3 0RX | London |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB | |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB | |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place Stoke-on-Trentn, SW7 2JN | |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place, St. Albansn SW7 2JN |
How can I fix it? Could you please give me some suggestions about this?
thanks in advance
I took a look at your logic and there appears to be a couple of reasons that you are getting the output you presented in your post.
1. In your SCAN function statement, you do not specify any details regarding delimiters. Which means it will use the default list of delimiters. This will cause a problem, since some of the city names contain these delimiter characters. For example, 'Brighton & Hove' contains the ampersand character. So when it scans by X number of words from the back, it will first look for HOVE, and then on the next loop 'BRIGHTON', but never 'BRIGHTON AND HOVE' as needed. By making some slight modifications, we can eliminate this issue.
2.Additionally, I made updates to the values in the city table to replace blank spaces between words with underscores. These spaces were also causing city names to be broken down into smaller chunks that didn't align with the values in the person address field.
3.Different spellings/characters being used for the same city name between the two tables.
- For example: On the city table there is, 'St Albans'. But on the company table, that city appears as, 'St. Albansn'. There is an additional 'n' at the end of the city name, a period after 'St' that does not exist for the value on the City table. So I cleaned the table name on the City table for this as well. Stoke-on-Trent also has a stray 'n' at the end of it on the Company table. These would be more database value issues that would need to be handled on that end as bad data entries etc. For the purposes of this code, I updated those two slightly to see how they could go through this code if spelled correctly.
4.In order to show all iterations, specifically when two city names appear in the same 'person address', to appear as two separate observations on the output the 'LEAVE' statement needed to be updated to an 'OUTPUT'.
Here is the logic for the updated datasets as well as the updated city name identification Hash logic:
DATA company;
LENGTH
company_name $ 24
country $ 2
person_address $ 64 ;
FORMAT
company_name $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFORMAT
company_name $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFILE DATALINES4
/* DLM='?'*/
DLM='7F'x
MISSOVER
DSD ;
INPUT
company_name : $CHAR24.
country : $CHAR2.
person_address : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RX
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trent, SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albans SW7 2JN
;;;;
run;
data city;
infile datalines truncover;
input city $50.;
datalines;
London
Aberdeen
Brighton & Hove
Kingston Upon Hull
Lancaster
Newcastle Upon Tyne
St. Albans
Stoke-on-Trent
;
run;
DATA WORK.CITY (RENAME=(CITY2=CITY));
SET WORK.CITY;
CITY2=TRANWRD(TRANWRD(TRANWRD(UPCASE(CITY),'. ','._'),' & ', '_&_'),' UPON ', '_UPON_');
DROP CITY;
RUN;
DATA SAMPLE(DROP=_:);
SET WORK.COMPANY;
IF _n_=1 THEN DO;
DCL HASH H1();
H1.DEFINEKEY('City');
H1.DEFINEDATA('City');
H1.DEFINEDONE();
DO UNTIL(DONE);
SET WORK.CITY END=DONE;
H1.REF(KEY:UPCASE(City), DATA:City);
END;
END;
CALL MISSING(City);
_Person_Addr_Upcase=TRANWRD(TRANWRD(TRANWRD(UPCASE(PERSON_ADDRESS),'. ','._'),' & ', '_&_'),' UPON ', '_UPON_');
_Word_Cnt=COUNTW(_Person_Addr_Upcase);
DO _i=_Word_Cnt TO 1 BY -1;
IF H1.FIND(KEY:(SCAN(_Person_Addr_Upcase,_i,', ')))=0 THEN OUTPUT;
END;
RUN;
I hope this helps.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.