BookmarkSubscribeRSS Feed
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

I would like to extract the city name from Address variable. for following table 1,

company_nmaecountry person_address 
2K POLYMER SYSTEMS GB 21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RX
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trentn, SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albansn SW7 2JN

I expect to get 

 

company_nmaecountry person_address City
2K POLYMER SYSTEMS GB 21 Warple Way, Aberdeen XAberdeen
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RXLondon
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RXAberdeen
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UBBrighton & Hove
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UBKingston upon Hull
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trentn, SW7 2JNStoke-on-Trentn
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albansn SW7 2JNSt. Albansn

 

I have two datasets, the COMPANY dataset includes person_address information, the CITY dataset includes city information. I would like to create a new variable in the COMPANY dataset and extract city name from 'person_address' variable based on 'city' variable (which stores in CITY dataset).

 

Especially, if a person_address stings both include two city name, I expect to extract both of them. for example for value '21 Warple Way,Aberdeen, Greater London W3 0RX', I expect to extract both London and Aberdeen.

 

 

 

However, by using the code like below,

DATA company;
    LENGTH
        company_nmae     $ 24
        country          $ 2
        person_address   $ 64 ;
    FORMAT
        company_nmae     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFORMAT
        company_nmae     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        company_nmae     : $CHAR24.
        country          : $CHAR2.
        person_address   : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X 
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RX
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trentn, SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albansn SW7 2JN
;;;;
run;

data city;
  infile datalines truncover;
  input city $50.;
  datalines;
London
Aberdeen
Brighton & Hove
Kingston upon Hull
Lancaster
Newcastle upon Tyne
St Albans
Stoke-on-Trent
;
run;

data sample(drop=_:);
  set company;
  if _n_=1 then 
    do;
      dcl hash h1();
      h1.defineKey('city');
      h1.defineData('city');
      h1.defineDone();
      do until(done);
        set city end=done;
        h1.ref(key:upcase(city), data:city);
      end;
    end;
  
  call missing(city);
  _word_cnt=countw(person_address);
  _person_addr_upcase=upcase(person_address);
  do _i=_word_cnt to 1 by -1;
    if h1.find(key:scan(_person_addr_upcase,_i))=0 then leave;
  end;
run;

proc print data=sample;
run;

I can only get 

Obs company_nmae country person_address city123456
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen XAberdeen
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RXLondon
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB 
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB 
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trentn, SW7 2JN 
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albansn SW7 2JN 

 

How can I fix it? Could you please give me some suggestions about this?

thanks in advance

 

1 REPLY 1
tsap
Pyrite | Level 9

I took a look at your logic and there appears to be a couple of reasons that you are getting the output you presented in your post.

1. In your SCAN function statement, you do not specify any details regarding delimiters. Which means it will use the default list of delimiters. This will cause a problem, since some of the city names contain these delimiter characters. For example, 'Brighton & Hove' contains the ampersand character. So when it scans by X number of words from the back, it will first look for HOVE, and then on the next loop 'BRIGHTON', but never 'BRIGHTON AND HOVE' as needed. By making some slight modifications, we can eliminate this issue.

 

2.Additionally, I made updates to the values in the city table to replace blank spaces between words with underscores. These spaces were also causing city names to be broken down into smaller chunks that didn't align with the values in the person address field.

 

3.Different spellings/characters being used for the same city name between the two tables.

     - For example: On the city table there is, 'St Albans'. But on the company table, that city appears as, 'St. Albansn'. There is an additional 'n' at the end of the city name, a period after 'St' that does not exist for the value on the City table. So I cleaned the table name on the City table for this as well. Stoke-on-Trent also has a stray 'n' at the end of it on the Company table. These would be more database value issues that would need to be handled on that end as bad data entries etc. For the purposes of this code, I updated those two slightly to see how they could go through this code if spelled correctly.

 

4.In order to show all iterations, specifically when two city names appear in the same 'person address', to appear as two separate observations on the output the 'LEAVE' statement needed to be updated to an 'OUTPUT'.

 

Here is the logic for the updated datasets as well as the updated city name identification Hash logic:

DATA company;
    LENGTH
        company_name     $ 24
        country          $ 2
        person_address   $ 64 ;
    FORMAT
        company_name     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFORMAT
        company_name     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFILE DATALINES4
/*        DLM='?'*/
		DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        company_name     : $CHAR24.
        country          : $CHAR2.
        person_address   : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X 
2K POLYMER SYSTEMSGB21 Warple Way, Aberdeen X
2K POLYMER SYSTEMSGB21 Warple Way,Aberdeen, Greater London W3 0RX
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Brighton & Hove BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, Kingston upon Hull BT27 6UB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place Stoke-on-Trent, SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, St. Albans SW7 2JN
;;;;
run;

data city;
  infile datalines truncover;
  input city $50.;
  datalines;
London
Aberdeen
Brighton & Hove
Kingston Upon Hull
Lancaster
Newcastle Upon Tyne
St. Albans
Stoke-on-Trent
;
run;

DATA WORK.CITY (RENAME=(CITY2=CITY));
	SET WORK.CITY;
	CITY2=TRANWRD(TRANWRD(TRANWRD(UPCASE(CITY),'. ','._'),' & ', '_&_'),' UPON ', '_UPON_');
	DROP CITY;
RUN;

DATA SAMPLE(DROP=_:);
  SET WORK.COMPANY;
  IF _n_=1 THEN DO;
      DCL HASH H1();
      H1.DEFINEKEY('City');
      H1.DEFINEDATA('City');
      H1.DEFINEDONE();
      DO UNTIL(DONE);
        SET WORK.CITY END=DONE;
        H1.REF(KEY:UPCASE(City), DATA:City);
      END;
    END;

	CALL MISSING(City);
	_Person_Addr_Upcase=TRANWRD(TRANWRD(TRANWRD(UPCASE(PERSON_ADDRESS),'. ','._'),' & ', '_&_'),' UPON ', '_UPON_');
	_Word_Cnt=COUNTW(_Person_Addr_Upcase);
	DO _i=_Word_Cnt TO 1 BY -1;
		IF H1.FIND(KEY:(SCAN(_Person_Addr_Upcase,_i,', ')))=0 THEN OUTPUT;
	END;
RUN;

 

I hope this helps.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 1257 views
  • 0 likes
  • 2 in conversation