Hey guys,
I am trying to figure out the best approach for this and what I need to do is the following:
I have 2 data sets:
Set1: Countries
Country_Code, Country_Name
Set2: Address FIle
Customer_Name, Address_1 through Address_5
The address file is free form text, and what I want to do is loop over the countries set and if it finds a matching country from the Country_Name column in Set1 within Address_1-5 of Set2 then return the Country_Code in a new Column on Set2
I just have no clue how to formulate that loop or what the best approach would be
All suggestions welcome!
I agree completely with @ballardw. Here's a quick mock-up that demonstrates the problem:
data Codes;
length CountryName $20 CountryCode 8;
input CountryName & CountryCode;
cards;
Brazil 1
Canada 2
Mexico 3
United States 4
run;
data Addresses;
infile cards missover;
length Name Address1 Address2 Address3 Address4 Address5 $20;
input Name & Address1 & Address2 & Address3 & Address4 & Address5;
cards;
Name 1 123 Some Street Detroit, MI United States
Name 2 RR 123 Out of the way county Carleton Ontario Canada
Name 3 Problem address 456 Canada Lane Florida United States
Name 4 Another address 789 Some Street Puerto Vallarta
Name 5 9876 Mexico Street Washington County Delaware
run;
data Addresses2;
set Addresses;
length ComboAddr $1000;
ComboAddr = catx(" ", Address1, Address2, Address3, Address4, Address5);
run;
proc sql noprint;
create table Addresses3 as
select Name, ComboAddr, CountryName, CountryCode
from Addresses2 a left outer join Codes c on(find(ComboAddr, CountryName, "it") > 1) order by Name;
quit;
Tom
I think you want to strongly consider your requirements.
The approach you suggest, besides being more than a tad awkward is entirely too likely to return false results because names of countries are known to appear in street addresses and city names.
It might help to provide an example of your two data sets though, and show what you expect to be the results of the process using them. SMALL examples as you have to provide the example output.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.
I agree completely with @ballardw. Here's a quick mock-up that demonstrates the problem:
data Codes;
length CountryName $20 CountryCode 8;
input CountryName & CountryCode;
cards;
Brazil 1
Canada 2
Mexico 3
United States 4
run;
data Addresses;
infile cards missover;
length Name Address1 Address2 Address3 Address4 Address5 $20;
input Name & Address1 & Address2 & Address3 & Address4 & Address5;
cards;
Name 1 123 Some Street Detroit, MI United States
Name 2 RR 123 Out of the way county Carleton Ontario Canada
Name 3 Problem address 456 Canada Lane Florida United States
Name 4 Another address 789 Some Street Puerto Vallarta
Name 5 9876 Mexico Street Washington County Delaware
run;
data Addresses2;
set Addresses;
length ComboAddr $1000;
ComboAddr = catx(" ", Address1, Address2, Address3, Address4, Address5);
run;
proc sql noprint;
create table Addresses3 as
select Name, ComboAddr, CountryName, CountryCode
from Addresses2 a left outer join Codes c on(find(ComboAddr, CountryName, "it") > 1) order by Name;
quit;
Tom
Agree with you both, valid points, the issue is that the address fields allow the users to input the country in any of the 5 fields, it all depends on the size of the address so the country could be in any of the 5 fields. I know it won't be super accurate in that regard when it has a country within the address as shown below, but I have to tackle the problem for the majority of cases I suppose 80/20 rule.
@JSifontes wrote:
Agree with you both, valid points, the issue is that the address fields allow the users to input the country in any of the 5 fields, it all depends on the size of the address so the country could be in any of the 5 fields.
Is it really any of the 5 fields, or would country always be the last non-blank field? And does every record have a country?
If every record has a country, it gets a little easier. You can just set aside the entries with more than one apparent country for manual review.
Tom
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.