Dear all,
I'd like to select most likely UK city for each company in table 1.
logic:
1 group by each 'company_name'.
2 split text in 'persn_address' by black, comma and semicolon
3 select the most frequent character which matched the city name in table 2.
4 keep them all if they appear equally often.
by using a sample in Table 1.
company_nmae | country | person_address |
2K POLYMER SYSTEMS | GB | |
2K POLYMER SYSTEMS | GB | 21 Warple Way,London, Greater London W3 0RX |
2K POLYMER SYSTEMS | GB | London |
2MEE | GB | |
2MEE | GB | York |
2S AIRCHANGERS | GB | |
2S AIRCHANGERS | GB | BEECH HOUSE, 4 CARRICKNAVEAGH ROAD, LISBURN, ANTRIM BT27 6UB |
2S AIRCHANGERS | GB | BEECH HOUSE;4 CARRICKNAVEAGH ROAD;LISBURN ANTRIM BT27 6UB |
2S AIRCHANGERS | GB | BROOK HOUSE,DUNMURRY INDUSTRIAL ESTATE,DUNMURRY,BELFAST,BT17 9HU |
2S AIRCHANGERS | GB | Beech House, 4 Carricknaveagh Road,Lisburn, Antrim BT27 6UB |
2S-SOFISTIKEJTED SISTEMS | GB | |
2S-SOPHISTICATED SYSTEMS | GB | |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place South Kensington,London SW7 2JN |
2S-SOPHISTICATED SYSTEMS | GB | 10 Cromwell Place, South Kensington,London SW7 2JN |
2S-SOPHISTICATED SYSTEMS | GB | London |
2VU | GB | |
3 SPACE SOFTWARE | GB | |
3 SPACE SOFTWARE | GB | 3 Soho Street,London W1V 5FA |
3-D COMPOSITS | GB | Cumbria |
32 DEGREES | GB | |
32 DEGREES | GB | Lynton House,7-12 Tavistock Square,London WC1H 9BQ |
32 DEGREES | GB | Nottingham |
table 2
City |
Aberdeen |
Armagh |
Bangor |
Bath |
Belfast |
Birmingham |
Bradford |
Brighton & Hove |
Bristol |
Cambridge |
Canterbury |
Cardiff |
Carlisle |
Chelmsford |
Chester |
Chichester |
Coventry |
Derby |
Derry |
Dundee |
Durham |
Edinburgh |
Ely |
Exeter |
Glasgow |
Gloucester |
Hereford |
Inverness |
Kingston upon Hull |
Lancaster |
Leeds |
Leicester |
Lichfield |
Lincoln |
Lisburn |
Liverpool |
London |
Manchester |
Newcastle upon Tyne |
Newport |
Newry |
Norwich |
Nottingham |
Oxford |
Perth |
Peterborough |
Plymouth |
Portsmouth |
Preston |
Ripon |
St Albans |
St Asaph |
St Davids |
Salford |
Salisbury |
Sheffield |
Southampton |
Stirling |
Stoke-on-Trent |
Sunderland |
Swansea |
Truro |
Wakefield |
Wells |
Westminster |
Winchester |
Wolverhampton |
Worcester |
York |
the first sample.
1 group the rows based on 'company_name' --'2K POLYMER SYSTEMS'
company_nmae | country | person_address |
2K POLYMER SYSTEMS | GB | |
2K POLYMER SYSTEMS | GB | 21 Warple Way,London, Greater London W3 0RX |
2K POLYMER SYSTEMS | GB | London |
2 spilt the text in 'person_address' based on the blank, comma and semicolon.
company_nmae | country | person_address |
2K POLYMER SYSTEMS | GB | |
2K POLYMER SYSTEMS | GB | 21 |
Warple | ||
Way | ||
London | ||
Greater | ||
London | ||
W3 | ||
0RX | ||
2K POLYMER SYSTEMS | GB | London |
3 select the most frequent character which matched the city name in table 2.
the most frequent character is 'London'. it also matches the city name in table 2 (i.e., London)
So the city of company '2K POLYMER SYSTEMS' is 'London'.
the second example.
32 DEGREES | GB | |
32 DEGREES | GB | Lynton House,7-12 Tavistock Square,London WC1H 9BQ |
32 DEGREES | GB | Nottingham |
the result should be
32 DEGREES | GB | London |
32 DEGREES | GB | Nottingham |
could you please give me some suggestion about this?
thanks in advance.
Below some code which lets you lookup the city per row which I guess is the bit you're struggling with.
It would help in the future if you could post sample data in the form of a data step so we don't have to spend time doing this.
DATA company;
LENGTH
company_nmae $ 24
country $ 2
person_address $ 64 ;
FORMAT
company_nmae $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFORMAT
company_nmae $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFILE DATALINES4
DLM='7F'x
MISSOVER
DSD ;
INPUT
company_nmae : $CHAR24.
country : $CHAR2.
person_address : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB
2K POLYMER SYSTEMSGB21 Warple Way,London, Greater London W3 0RX
2K POLYMER SYSTEMSGBLondon
2MEEGB
2MEEGBYork
2S AIRCHANGERSGB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, LISBURN, ANTRIM BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE;4 CARRICKNAVEAGH ROAD;LISBURN ANTRIM BT27 6UB
2S AIRCHANGERSGBBROOK HOUSE,DUNMURRY INDUSTRIAL ESTATE,DUNMURRY,BELFAST,BT17 9HU
2S AIRCHANGERSGBBeech House, 4 Carricknaveagh Road,Lisburn, Antrim BT27 6UB
2S-SOFISTIKEJTED SISTEMSGB
2S-SOPHISTICATED SYSTEMSGB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGBLondon
2VUGB
3 SPACE SOFTWAREGB
3 SPACE SOFTWAREGB3 Soho Street,London W1V 5FA
3-D COMPOSITSGBCumbria
32 DEGREESGB
32 DEGREESGBLynton House,7-12 Tavistock Square,London WC1H 9BQ
32 DEGREESGBNottingham
;;;;
run;
data city;
infile datalines truncover;
input city $50.;
datalines;
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry
Dundee
Durham
Edinburgh
Ely
Exeter
Glasgow
Gloucester
Hereford
Inverness
Kingston upon Hull
Lancaster
Leeds
Leicester
Lichfield
Lincoln
Lisburn
Liverpool
London
Manchester
Newcastle upon Tyne
Newport
Newry
Norwich
Nottingham
Oxford
Perth
Peterborough
Plymouth
Portsmouth
Preston
Ripon
St Albans
St Asaph
St Davids
Salford
Salisbury
Sheffield
Southampton
Stirling
Stoke-on-Trent
Sunderland
Swansea
Truro
Wakefield
Wells
Westminster
Winchester
Wolverhampton
Worcester
York
;
run;
data sample(drop=_:);
set company;
if _n_=1 then
do;
dcl hash h1();
h1.defineKey('city');
h1.defineData('city');
h1.defineDone();
do until(done);
set city end=done;
h1.ref(key:upcase(city), data:city);
end;
end;
call missing(city);
_word_cnt=countw(person_address);
_person_addr_upcase=upcase(person_address);
do _i=_word_cnt to 1 by -1;
if h1.find(key:scan(_person_addr_upcase,_i))=0 then leave;
end;
run;
Below some code which lets you lookup the city per row which I guess is the bit you're struggling with.
It would help in the future if you could post sample data in the form of a data step so we don't have to spend time doing this.
DATA company;
LENGTH
company_nmae $ 24
country $ 2
person_address $ 64 ;
FORMAT
company_nmae $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFORMAT
company_nmae $CHAR24.
country $CHAR2.
person_address $CHAR64. ;
INFILE DATALINES4
DLM='7F'x
MISSOVER
DSD ;
INPUT
company_nmae : $CHAR24.
country : $CHAR2.
person_address : $CHAR64. ;
DATALINES4;
2K POLYMER SYSTEMSGB
2K POLYMER SYSTEMSGB21 Warple Way,London, Greater London W3 0RX
2K POLYMER SYSTEMSGBLondon
2MEEGB
2MEEGBYork
2S AIRCHANGERSGB
2S AIRCHANGERSGBBEECH HOUSE, 4 CARRICKNAVEAGH ROAD, LISBURN, ANTRIM BT27 6UB
2S AIRCHANGERSGBBEECH HOUSE;4 CARRICKNAVEAGH ROAD;LISBURN ANTRIM BT27 6UB
2S AIRCHANGERSGBBROOK HOUSE,DUNMURRY INDUSTRIAL ESTATE,DUNMURRY,BELFAST,BT17 9HU
2S AIRCHANGERSGBBeech House, 4 Carricknaveagh Road,Lisburn, Antrim BT27 6UB
2S-SOFISTIKEJTED SISTEMSGB
2S-SOPHISTICATED SYSTEMSGB
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGB10 Cromwell Place, South Kensington,London SW7 2JN
2S-SOPHISTICATED SYSTEMSGBLondon
2VUGB
3 SPACE SOFTWAREGB
3 SPACE SOFTWAREGB3 Soho Street,London W1V 5FA
3-D COMPOSITSGBCumbria
32 DEGREESGB
32 DEGREESGBLynton House,7-12 Tavistock Square,London WC1H 9BQ
32 DEGREESGBNottingham
;;;;
run;
data city;
infile datalines truncover;
input city $50.;
datalines;
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry
Dundee
Durham
Edinburgh
Ely
Exeter
Glasgow
Gloucester
Hereford
Inverness
Kingston upon Hull
Lancaster
Leeds
Leicester
Lichfield
Lincoln
Lisburn
Liverpool
London
Manchester
Newcastle upon Tyne
Newport
Newry
Norwich
Nottingham
Oxford
Perth
Peterborough
Plymouth
Portsmouth
Preston
Ripon
St Albans
St Asaph
St Davids
Salford
Salisbury
Sheffield
Southampton
Stirling
Stoke-on-Trent
Sunderland
Swansea
Truro
Wakefield
Wells
Westminster
Winchester
Wolverhampton
Worcester
York
;
run;
data sample(drop=_:);
set company;
if _n_=1 then
do;
dcl hash h1();
h1.defineKey('city');
h1.defineData('city');
h1.defineDone();
do until(done);
set city end=done;
h1.ref(key:upcase(city), data:city);
end;
end;
call missing(city);
_word_cnt=countw(person_address);
_person_addr_upcase=upcase(person_address);
do _i=_word_cnt to 1 by -1;
if h1.find(key:scan(_person_addr_upcase,_i))=0 then leave;
end;
run;
Dear Patrick,
thank you for your help.
I will post sample data in the form of a data step in the future.
besides, I have one more question. How can I select the city if the name of a city is constructed by two or three characters? for example 'Greater Manchester'
thanks in advance.
This does require you to separate out the city component and then it does two things.
1. If the city is in the list as is, that value is assumed to be correct.
2. If the city is not in the list, then you calculate the distance between the city names and take the lowest value.
*master city lookup data;
data city;
infile datalines truncover;
input city $50.;
datalines;
Aberdeen
Armagh
Bangor
Bath
Belfast
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry
Derby
Derry
;
;
;
run;
*data with spelling mistakes;
data have;
infile datalines truncover;
input city $50.;
datalines;
Aberden
Armagh
Bangor
Bath
Belfast234
Birmingham
Bradford
Brighton & Hove
Bristol
Cambridge
Canterbury Street
Cardiff
Carlisle
Chelmsford
Chester
Chichester
Coventry Road
Derby
Derry
;
;
;
run;
*find the nearest city;
data nearest_city;
array m_city (19) $50 _temporary_ ;
*load into array;
if _n_=1 then
do i=1 to 19;
set City;
m_city(i)=city;
end;
set have;
dist=9999999;
index=0;
if whichc(city, of m_city(*)) then
correct_city=city;
else
do i=1 to 19;
_dist=compged(city, m_city(i));
if _dist < dist then
do;
index=i;
dist=_dist;
end;
end;
if missing(correct_city) then
correct_city=m_city(index);
*drop _: index i dist;
run;
Dear Reeza,
Just remind. thanks in advance.
another way you can do is
proc sql;
create table want as
select a.*, b.* from company a
left join city b
on index(upcase(trim(person_address)),upcase(trim(city))) gt 0;
You seem to have postal codes for almost all of the companies, which appear fairly easy to find in the address field. Isn't there a publicly available file linking postal codes to city names? I presume that would be a very accurate technique for those companies.
Then you would have a small subset over which you face other parsing issues and the attendant uncertainties.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.