Re: Search address string to assign country code

Sean_OConnor · Posted 05-23-2019 09:46 AM

Folks,

I have a number of address that I want to search for a country name.

At present I'm using the following code.

data countries;
set test;
if index(address,"LONDON")>0 THEN countyname_1='UK';
if index(address,"MANCHESTER")>0 THEN countyname_1='UK';
if index(address,"LEEDS")>0 THEN countyname_1='UK';
if index(address,"SCOTLAND")>0 THEN countyname_1='UK';
if index(address,"WALES")>0 THEN countyname_1='UK';
if index(address,"LIVERPOOL")>0 THEN countyname_1='UK';
if index(address,"SCOTLAND")>0 THEN countyname_1='UK';
if index(address,"ENGLAND")>0 THEN countyname_1='UK';
if index(address,"GLASGOW")>0 THEN countyname_1='UK';
if index(address,"Edinburgh")>0 THEN countyname_1='UK';
if index(address,"BRISTOL")>0 THEN countyname_1='UK';
if index(address,"CHESHIRE")>0 THEN countyname_1='UK';
if index(address,"BIRMINGHAM")>0 THEN countyname_1='UK';
if index(address,"ESSEX")>0 THEN countyname_1='UK';
if index(address,"BIRMINGHAM")>0 THEN countyname_1='UK';
if index(address,"SHIRE")>0 THEN countyname_1='UK';
if index(address,"SURREY")>0 THEN countyname_1='UK';
if index(address,"SUNDERLAND")>0 THEN countyname_1='UK';
if index(address,"DEVON")>0 THEN countyname_1='UK';
if index(address,"NEW YORK")>0 THEN countyname_1='USA and Canada';
if index(address,"NY")>0 THEN countyname_1='USA and Canada';
if index(address,"USA")>0 THEN countyname_1='USA and Canada';
if index(address,"CANADA")>0 THEN countyname_1='USA and Canada';
if index(address,"TORONTO")>0 THEN countyname_1='USA and Canada';
if index(address,"BELFAST")>0 THEN countyname_1='Northern Ireland';
if index(address,"DERRY")>0 THEN countyname_1='Northern Ireland';
if index(address,"NORTHERN IRELAND")>0 THEN countyname_1='Northern Ireland';
if index(address,"ANTRIM")>0 THEN countyname_1='Northern Ireland';
if index(address,"ARMAGH")>0 THEN countyname_1='Northern Ireland';
if index(address,"TYRONE")>0 THEN countyname_1='Northern Ireland';
if index(address,"AUSTRALIA")>0 THEN countyname_1='Australia and New Zealand';
if index(address,"ZEALAND")>0 THEN countyname_1='Australia and New Zealand';
if index(address,"QUEENSLAND")>0 THEN countyname_1='Australia and New Zealand';
if index(address,"FRANCE")>0 THEN countyname_1='Mainland Europe';
if index(address,"SPAIN")>0 THEN countyname_1='Mainland Europe';
if index(address,"GERMANY")>0 THEN countyname_1='Mainland Europe';


run;

The problem is that certain addresses have maybe 'York' and 'new York'. Is there a way to use the index function to read a string from right to left, as usually the country is written in the last area?

Also, is there a more efficient way to search in strings? Originally I was going to write all my uk regions into a macro and search for them seperarly. I'm not sure the best way to search though. Is it with the index option?

proc sql;

select name into :UK_LIST separated by '|' 
from test;
quit;

ballardw · Posted 05-23-2019 10:20 AM

Countyname not Countryname?

One might suspect that perhaps your data was read into SAS improperly such that you can't distinguish fields. Can you provide an example of the file that you read to create your Test set? Or examples of what your ADDRESS variable actually looks like?

Cities are an extremely poor approach as many major, and some not so major, cities in UK, France and even Spain and Germany have equivalents in the US, sometimes more than one. I believe that Australian likely has some of the same names as UK as well.

For related text such as your "New York" versus "York" issue you can do a hierarchical search with If / then/ else such as

if index(address,"NEW YORK")>0 THEN countyname_1='USA and Canada';
else if index(address,"YORK")>0 THEN countyname_1='UK';

33pedro · Posted 05-23-2019 10:29 AM

It is possible to provide an extract of the dataset you have?

I think it might be preferrable to have a seperate variable for number, street name, city, region, country etc - that would make searching and creating new variables more straightforward.

At the moment I would be concerned that there is too much potential overlap between lines like

if index(address,"GERMANY")>0 THEN countyname_1='Mainland Europe';

and

if index(address,"NY")>0 THEN countyname_1='USA and Canada';

if they were ordered as above (i know they arent in your example) instances of 'Germany' could be allocated to 'USA and Canada'. Also is it possible for these strings to also appear in street names and so generate false results?

Maybe the data could be read in differently if the structure of the datasets does indeed turn out to be an issue?

Sean_OConnor · Posted 05-24-2019 06:16 AM

Folks,

I have been think through this and I think I have a potential working solution. However, I'm having trouble with my code.

Basically I want to write all my keywords into an array and search my address variable for each instance. Here is my attempt at writing code which I know doesn't work.

data addressstrings;
format location $40.;
set not_assinged(obs=10);
array x (*) ("DUBLIN","BELFAST","LONDON");
maxloc=0;
do i=1 to length(x);
if index(address,x{i})>maxloc do;
location=x{i};
maxloc=index(address,x{i});
end;
run;

So I'm going to specify a list of keywords in an array.

I then want SAS to cycle through the address string. If it finds "Dublin" it writes "Dublin" to the location variable. However, it cycles through again looking at the string for "Belfast". If it finds "Belfast" it overwrites the variable location from Dublin to Belfast if the maxloc is greater (if it is further to the right in the string). This carries on until it finishes looking through the array object.

The assumption here is that people write the largest geographic unit at the end of an address string so this should hopefully pick this up.

I'm aware of other issues around misspellings etc. but I'm hoping this should get me some of the way there.

Would anyone be able to help me in writing a piece of code that actually works, please?

33pedro · Posted 05-24-2019 06:58 AM

Assuming the seperators for the different parts of the address are consistent between observations then I would be tempted to use the scan function to write the last 'word' out to a new variable and then work with that. There does seem to be a lot of assumptions here however about likeness of the data between observations that would concern me.

Sean_OConnor · Posted 05-24-2019 09:57 AM

Would you know how I could amend my code to do this?

33pedro · Posted 05-24-2019 10:42 AM

Well assuming the delimiters that seperates the words in each address are consistent then

data address;
input add $ 1-50;
datalines;
1 top road london england
2 low drive perth scotland
3 high street perth australia
;
run;

data area;
set address;
city = scan(add,-2);
run;

However it does rely on a lot of IFs and BUTs in terms of your dataset as already pointed out, hope it helps though.

Search address string to assign country code