BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
France
Quartz | Level 8

Dear all,

 

I'd like to select most likely UK city for each company in table 1.

 

logic:

1 group by each 'company_name'.

2 split text in 'persn_address' by black, comma and semicolon

3 select the most frequent character which matched the city name in table 2.

4 keep them all if they appear equally often. 

 

by using a sample in Table 1.

company_nmaecountryperson_address
2K POLYMER SYSTEMSGB 
2K POLYMER SYSTEMSGB21 Warple Way,London, Greater London W3 0RX
2K POLYMER SYSTEMSGBLondon
2MEEGB 
2MEEGBYork
2S AIRCHANGERSGB 
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, LISBURN, ANTRIM BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE;4 CARRICKNAVEAGH ROAD;LISBURN ANTRIM BT27 6UB
2S AIRCHANGERSGBBROOK HOUSE,DUNMURRY INDUSTRIAL ESTATE,DUNMURRY,BELFAST,BT17 9HU
2S AIRCHANGERSGBBeech House, 4 Carricknaveagh Road,Lisburn, Antrim BT27 6UB
2S-SOFISTIKEJTED SISTEMSGB 
2S-SOPHISTICATED SYSTEMSGB 
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGBLondon
2VUGB 
3 SPACE SOFTWAREGB 
3 SPACE SOFTWAREGB3 Soho Street,London W1V 5FA
3-D COMPOSITSGBCumbria
32 DEGREESGB 
32 DEGREESGBLynton House,7-12 Tavistock Square,London WC1H 9BQ
32 DEGREESGBNottingham
   

 

table 2 

City
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry 
Dundee
Durham
Edinburgh
Ely
Exeter
Glasgow
Gloucester
Hereford
Inverness
Kingston upon Hull
Lancaster
Leeds
Leicester
Lichfield
Lincoln
Lisburn
Liverpool
London
Manchester
Newcastle upon Tyne
Newport
Newry
Norwich
Nottingham
Oxford
Perth
Peterborough
Plymouth
Portsmouth
Preston
Ripon
St Albans
St Asaph
St Davids
Salford
Salisbury
Sheffield
Southampton
Stirling
Stoke-on-Trent
Sunderland
Swansea
Truro
Wakefield
Wells
Westminster
Winchester
Wolverhampton
Worcester
York

 

 the first sample.

1 group the rows based on 'company_name' --'2K POLYMER SYSTEMS'

 

company_nmaecountryperson_address
2K POLYMER SYSTEMSGB 
2K POLYMER SYSTEMSGB21 Warple Way,London, Greater London W3 0RX
2K POLYMER SYSTEMSGBLondon

 

2 spilt the text in 'person_address' based on the blank, comma and semicolon.

company_nmaecountryperson_address
2K POLYMER SYSTEMSGB 
2K POLYMER SYSTEMSGB21
  Warple
  Way
  London
  Greater
  London
  W3
  0RX
2K POLYMER SYSTEMSGBLondon

select the most frequent character which matched the city name in table 2.

the most frequent character is 'London'. it also matches the city name in table 2 (i.e., London)

 

So the city of company '2K POLYMER SYSTEMS' is 'London'.

 

the second example.

32 DEGREESGB 
32 DEGREESGBLynton House,7-12 Tavistock Square,London WC1H 9BQ
32 DEGREESGBNottingham

the result should be 

32 DEGREESGBLondon
32 DEGREESGBNottingham

 

could you please give me some suggestion about this? 

thanks in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Below some code which lets you lookup the city per row which I guess is the bit you're struggling with.

It would help in the future if you could post sample data in the form of a data step so we don't have to spend time doing this.

DATA company;
    LENGTH
        company_nmae     $ 24
        country          $ 2
        person_address   $ 64 ;
    FORMAT
        company_nmae     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFORMAT
        company_nmae     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        company_nmae     : $CHAR24.
        country          : $CHAR2.
        person_address   : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB 
2K POLYMER SYSTEMSGB21 Warple Way,London, Greater London W3 0RX
2K POLYMER SYSTEMSGBLondon
2MEEGB 
2MEEGBYork
2S AIRCHANGERSGB 
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, LISBURN, ANTRIM BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE;4 CARRICKNAVEAGH ROAD;LISBURN ANTRIM BT27 6UB
2S AIRCHANGERSGBBROOK HOUSE,DUNMURRY INDUSTRIAL ESTATE,DUNMURRY,BELFAST,BT17 9HU
2S AIRCHANGERSGBBeech House, 4 Carricknaveagh Road,Lisburn, Antrim BT27 6UB
2S-SOFISTIKEJTED SISTEMSGB 
2S-SOPHISTICATED SYSTEMSGB 
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGBLondon
2VUGB 
3 SPACE SOFTWAREGB 
3 SPACE SOFTWAREGB3 Soho Street,London W1V 5FA
3-D COMPOSITSGBCumbria
32 DEGREESGB 
32 DEGREESGBLynton House,7-12 Tavistock Square,London WC1H 9BQ
32 DEGREESGBNottingham
;;;;
run;

data city;
  infile datalines truncover;
  input city $50.;
  datalines;
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry 
Dundee
Durham
Edinburgh
Ely
Exeter
Glasgow
Gloucester
Hereford
Inverness
Kingston upon Hull
Lancaster
Leeds
Leicester
Lichfield
Lincoln
Lisburn
Liverpool
London
Manchester
Newcastle upon Tyne
Newport
Newry
Norwich
Nottingham
Oxford
Perth
Peterborough
Plymouth
Portsmouth
Preston
Ripon
St Albans
St Asaph
St Davids
Salford
Salisbury
Sheffield
Southampton
Stirling
Stoke-on-Trent
Sunderland
Swansea
Truro
Wakefield
Wells
Westminster
Winchester
Wolverhampton
Worcester
York
;
run;

data sample(drop=_:);
  set company;
  if _n_=1 then 
    do;
      dcl hash h1();
      h1.defineKey('city');
      h1.defineData('city');
      h1.defineDone();
      do until(done);
        set city end=done;
        h1.ref(key:upcase(city), data:city);
      end;
    end;
  
  call missing(city);
  _word_cnt=countw(person_address);
  _person_addr_upcase=upcase(person_address);
  do _i=_word_cnt to 1 by -1;
    if h1.find(key:scan(_person_addr_upcase,_i))=0 then leave;
  end;
run;

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

Below some code which lets you lookup the city per row which I guess is the bit you're struggling with.

It would help in the future if you could post sample data in the form of a data step so we don't have to spend time doing this.

DATA company;
    LENGTH
        company_nmae     $ 24
        country          $ 2
        person_address   $ 64 ;
    FORMAT
        company_nmae     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFORMAT
        company_nmae     $CHAR24.
        country          $CHAR2.
        person_address   $CHAR64. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        company_nmae     : $CHAR24.
        country          : $CHAR2.
        person_address   : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB 
2K POLYMER SYSTEMSGB21 Warple Way,London, Greater London W3 0RX
2K POLYMER SYSTEMSGBLondon
2MEEGB 
2MEEGBYork
2S AIRCHANGERSGB 
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, LISBURN, ANTRIM BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE;4 CARRICKNAVEAGH ROAD;LISBURN ANTRIM BT27 6UB
2S AIRCHANGERSGBBROOK HOUSE,DUNMURRY INDUSTRIAL ESTATE,DUNMURRY,BELFAST,BT17 9HU
2S AIRCHANGERSGBBeech House, 4 Carricknaveagh Road,Lisburn, Antrim BT27 6UB
2S-SOFISTIKEJTED SISTEMSGB 
2S-SOPHISTICATED SYSTEMSGB 
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGBLondon
2VUGB 
3 SPACE SOFTWAREGB 
3 SPACE SOFTWAREGB3 Soho Street,London W1V 5FA
3-D COMPOSITSGBCumbria
32 DEGREESGB 
32 DEGREESGBLynton House,7-12 Tavistock Square,London WC1H 9BQ
32 DEGREESGBNottingham
;;;;
run;

data city;
  infile datalines truncover;
  input city $50.;
  datalines;
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry 
Dundee
Durham
Edinburgh
Ely
Exeter
Glasgow
Gloucester
Hereford
Inverness
Kingston upon Hull
Lancaster
Leeds
Leicester
Lichfield
Lincoln
Lisburn
Liverpool
London
Manchester
Newcastle upon Tyne
Newport
Newry
Norwich
Nottingham
Oxford
Perth
Peterborough
Plymouth
Portsmouth
Preston
Ripon
St Albans
St Asaph
St Davids
Salford
Salisbury
Sheffield
Southampton
Stirling
Stoke-on-Trent
Sunderland
Swansea
Truro
Wakefield
Wells
Westminster
Winchester
Wolverhampton
Worcester
York
;
run;

data sample(drop=_:);
  set company;
  if _n_=1 then 
    do;
      dcl hash h1();
      h1.defineKey('city');
      h1.defineData('city');
      h1.defineDone();
      do until(done);
        set city end=done;
        h1.ref(key:upcase(city), data:city);
      end;
    end;
  
  call missing(city);
  _word_cnt=countw(person_address);
  _person_addr_upcase=upcase(person_address);
  do _i=_word_cnt to 1 by -1;
    if h1.find(key:scan(_person_addr_upcase,_i))=0 then leave;
  end;
run;
France
Quartz | Level 8

Dear Patrick,

 

thank you for your help.

 

I will post sample data in the form of a data step in the future.

 

besides, I have one more question. How can I select the city if the name of a city is constructed by two or three characters? for example 'Greater Manchester'

 

thanks in advance.

Patrick
Opal | Level 21

@France

I didn't answer your "greater manchester" question yet because I was waiting for @Reeza's answer. Now that you've accepted my solution: Have you found an answer yourself or do you still require a solution for this?

Reeza
Super User

This does require you to separate out the city component and then it does two things. 

 

1. If the city is in the list as is, that value is assumed to be correct. 

2. If the city is not in the list, then you calculate the distance between the city names and take the lowest value. 

 

*master city lookup data;
data city;
 infile datalines truncover;
 input city $50.;
 datalines;
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry 
;
 ;
 ;
run;

*data with spelling mistakes;
data have;
 infile datalines truncover;
 input city $50.;
 datalines;
Aberden
Armagh
Bangor
Bath
Belfast234
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury Street
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry Road
Derby
Derry 
;
 ;
 ;
run;


*find the nearest city;
data nearest_city;
 array m_city (19) $50 _temporary_ ;
 *load into array;

 if _n_=1 then
    do i=1 to 19;
        set City;
        m_city(i)=city;
    end;

set have;
dist=9999999;
index=0;

if whichc(city, of m_city(*)) then
 correct_city=city;
else
 do i=1 to 19;
    _dist=compged(city, m_city(i));

    if _dist < dist then
        do;
            index=i;
            dist=_dist;
        end;
 end;

if missing(correct_city) then
 correct_city=m_city(index);
 
 *drop _: index i dist;
run;
Reeza
Super User
I have some code that looks for an exact match, if it doesn't find it, it then looks for the nearest match via distance based on names. However you do need to pull out the city first. Remind me tomorrow and I can pull it up for you, it's part of a work script.
France
Quartz | Level 8

Dear Reeza,

 

Just remind. thanks in advance. 

kiranv_
Rhodochrosite | Level 12

another way you can do is

 

proc sql;
create table want as
select a.*, b.* from company a
left join city b
on index(upcase(trim(person_address)),upcase(trim(city))) gt 0;
mkeintz
PROC Star

You seem to have postal codes for almost all of the companies, which appear fairly easy to find in the address field.  Isn't there a publicly available file linking postal codes to city names?  I presume that would be a very accurate technique for those companies.

 

Then you would have a small subset over  which you face other parsing issues and the attendant uncertainties.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1226 views
  • 6 likes
  • 5 in conversation