BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JSifontes
Fluorite | Level 6

Hey guys, 

 

I am trying to figure out the best approach for this and what I need to do is the following:

 

I have 2 data sets:

Set1: Countries

Country_Code, Country_Name

 

Set2: Address FIle

Customer_Name, Address_1 through Address_5

 

The address file is free form text, and what I want to do is loop over the countries set and if it finds a matching country from the Country_Name column in Set1 within Address_1-5 of Set2  then return the Country_Code in a new Column on Set2

I just have no clue how to formulate that loop or what the best approach would be

All suggestions welcome!

1 ACCEPTED SOLUTION

Accepted Solutions
TomKari
Onyx | Level 15

I agree completely with @ballardw. Here's a quick mock-up that demonstrates the problem:

 

data Codes;
	length CountryName $20 CountryCode 8;
	input CountryName & CountryCode;
	cards;
Brazil  1
Canada  2
Mexico  3
United States  4
run;

data Addresses;
	infile cards missover;
	length Name Address1 Address2 Address3 Address4 Address5 $20;
	input Name & Address1 & Address2 & Address3 & Address4 & Address5;
	cards;
Name 1  123 Some Street  Detroit, MI  United States
Name 2  RR 123  Out of the way county  Carleton  Ontario  Canada
Name 3  Problem address  456 Canada Lane  Florida  United States
Name 4  Another address  789 Some Street  Puerto Vallarta
Name 5  9876 Mexico Street  Washington County  Delaware
run;

data Addresses2;
	set Addresses;
	length ComboAddr $1000;
	ComboAddr = catx(" ", Address1, Address2, Address3, Address4, Address5);
run;

proc sql noprint;
	create table Addresses3 as
		select Name, ComboAddr, CountryName, CountryCode
			from Addresses2 a left outer join Codes c on(find(ComboAddr, CountryName, "it") > 1) order by Name;
quit;

 

Tom

View solution in original post

7 REPLIES 7
ballardw
Super User

I think you want to strongly consider your requirements.

The approach you suggest, besides being more than a tad awkward is entirely too likely to return false results because names of countries are known to appear in street addresses and city names.

 

It might help to provide an example of your two data sets though, and show what you expect to be the results of the process using them. SMALL examples as you have to provide the example output.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

TomKari
Onyx | Level 15

I agree completely with @ballardw. Here's a quick mock-up that demonstrates the problem:

 

data Codes;
	length CountryName $20 CountryCode 8;
	input CountryName & CountryCode;
	cards;
Brazil  1
Canada  2
Mexico  3
United States  4
run;

data Addresses;
	infile cards missover;
	length Name Address1 Address2 Address3 Address4 Address5 $20;
	input Name & Address1 & Address2 & Address3 & Address4 & Address5;
	cards;
Name 1  123 Some Street  Detroit, MI  United States
Name 2  RR 123  Out of the way county  Carleton  Ontario  Canada
Name 3  Problem address  456 Canada Lane  Florida  United States
Name 4  Another address  789 Some Street  Puerto Vallarta
Name 5  9876 Mexico Street  Washington County  Delaware
run;

data Addresses2;
	set Addresses;
	length ComboAddr $1000;
	ComboAddr = catx(" ", Address1, Address2, Address3, Address4, Address5);
run;

proc sql noprint;
	create table Addresses3 as
		select Name, ComboAddr, CountryName, CountryCode
			from Addresses2 a left outer join Codes c on(find(ComboAddr, CountryName, "it") > 1) order by Name;
quit;

 

Tom

JSifontes
Fluorite | Level 6

Agree with you both, valid points, the issue is that the address fields allow the users to input the country in any of the 5 fields, it all depends on the size of the address so the country could be in any of the 5 fields. I know it won't be super accurate in that regard when it has a country within the address as shown below, but I have to tackle the problem for the majority of cases I suppose 80/20 rule. 

mkeintz
PROC Star

@JSifontes wrote:

Agree with you both, valid points, the issue is that the address fields allow the users to input the country in any of the 5 fields, it all depends on the size of the address so the country could be in any of the 5 fields.

Is it really any of the 5 fields, or would country always be the last non-blank field?  And does every record have a country?


 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
JSifontes
Fluorite | Level 6
It's the last non-blank usually, it's just that it could be in any of the 5 fields depending on the length of the address
JSifontes
Fluorite | Level 6
And yes every record has a country
TomKari
Onyx | Level 15

If every record has a country, it gets a little easier. You can just set aside the entries with more than one apparent country for manual review.

 

Tom

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2193 views
  • 1 like
  • 4 in conversation