BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JSifontes
Fluorite | Level 6

Hey guys, 

 

I am trying to figure out the best approach for this and what I need to do is the following:

 

I have 2 data sets:

Set1: Countries

Country_Code, Country_Name

 

Set2: Address FIle

Customer_Name, Address_1 through Address_5

 

The address file is free form text, and what I want to do is loop over the countries set and if it finds a matching country from the Country_Name column in Set1 within Address_1-5 of Set2  then return the Country_Code in a new Column on Set2

I just have no clue how to formulate that loop or what the best approach would be

All suggestions welcome!

1 ACCEPTED SOLUTION

Accepted Solutions
TomKari
Onyx | Level 15

I agree completely with @ballardw. Here's a quick mock-up that demonstrates the problem:

 

data Codes;
	length CountryName $20 CountryCode 8;
	input CountryName & CountryCode;
	cards;
Brazil  1
Canada  2
Mexico  3
United States  4
run;

data Addresses;
	infile cards missover;
	length Name Address1 Address2 Address3 Address4 Address5 $20;
	input Name & Address1 & Address2 & Address3 & Address4 & Address5;
	cards;
Name 1  123 Some Street  Detroit, MI  United States
Name 2  RR 123  Out of the way county  Carleton  Ontario  Canada
Name 3  Problem address  456 Canada Lane  Florida  United States
Name 4  Another address  789 Some Street  Puerto Vallarta
Name 5  9876 Mexico Street  Washington County  Delaware
run;

data Addresses2;
	set Addresses;
	length ComboAddr $1000;
	ComboAddr = catx(" ", Address1, Address2, Address3, Address4, Address5);
run;

proc sql noprint;
	create table Addresses3 as
		select Name, ComboAddr, CountryName, CountryCode
			from Addresses2 a left outer join Codes c on(find(ComboAddr, CountryName, "it") > 1) order by Name;
quit;

 

Tom

View solution in original post

7 REPLIES 7
ballardw
Super User

I think you want to strongly consider your requirements.

The approach you suggest, besides being more than a tad awkward is entirely too likely to return false results because names of countries are known to appear in street addresses and city names.

 

It might help to provide an example of your two data sets though, and show what you expect to be the results of the process using them. SMALL examples as you have to provide the example output.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

TomKari
Onyx | Level 15

I agree completely with @ballardw. Here's a quick mock-up that demonstrates the problem:

 

data Codes;
	length CountryName $20 CountryCode 8;
	input CountryName & CountryCode;
	cards;
Brazil  1
Canada  2
Mexico  3
United States  4
run;

data Addresses;
	infile cards missover;
	length Name Address1 Address2 Address3 Address4 Address5 $20;
	input Name & Address1 & Address2 & Address3 & Address4 & Address5;
	cards;
Name 1  123 Some Street  Detroit, MI  United States
Name 2  RR 123  Out of the way county  Carleton  Ontario  Canada
Name 3  Problem address  456 Canada Lane  Florida  United States
Name 4  Another address  789 Some Street  Puerto Vallarta
Name 5  9876 Mexico Street  Washington County  Delaware
run;

data Addresses2;
	set Addresses;
	length ComboAddr $1000;
	ComboAddr = catx(" ", Address1, Address2, Address3, Address4, Address5);
run;

proc sql noprint;
	create table Addresses3 as
		select Name, ComboAddr, CountryName, CountryCode
			from Addresses2 a left outer join Codes c on(find(ComboAddr, CountryName, "it") > 1) order by Name;
quit;

 

Tom

JSifontes
Fluorite | Level 6

Agree with you both, valid points, the issue is that the address fields allow the users to input the country in any of the 5 fields, it all depends on the size of the address so the country could be in any of the 5 fields. I know it won't be super accurate in that regard when it has a country within the address as shown below, but I have to tackle the problem for the majority of cases I suppose 80/20 rule. 

mkeintz
PROC Star

@JSifontes wrote:

Agree with you both, valid points, the issue is that the address fields allow the users to input the country in any of the 5 fields, it all depends on the size of the address so the country could be in any of the 5 fields.

Is it really any of the 5 fields, or would country always be the last non-blank field?  And does every record have a country?


 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
JSifontes
Fluorite | Level 6
It's the last non-blank usually, it's just that it could be in any of the 5 fields depending on the length of the address
JSifontes
Fluorite | Level 6
And yes every record has a country
TomKari
Onyx | Level 15

If every record has a country, it gets a little easier. You can just set aside the entries with more than one apparent country for manual review.

 

Tom

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 985 views
  • 1 like
  • 4 in conversation