I have an address column in my dataset that I wanted to parse to put city and zip code in their separate columns. Positions differ in many of the addresses. Examples of address types in the data are:
1. Address: 321 E 12th St Des Moines Iowa, 50321phone Number: (515) 515-5151
2. Address: 321 Iowa st Boone Iowa, 50321Phone Number: (515) 515-5151
3. Address: 321 Des Moines st Suite 140 Iowa City Iowa, 50321Phone Number: (515) 515-5151
Thanks
It's nearly impossible for us to create a set of rules based on your three examples that will likely work on a wide variety of real-life examples.
Could you state the rules to find city and Zip Code that you think will work given the real data that you have?
I do not really have any rules in mind. And these examples are real data.
Yes, but you can come up with rules by looking at the real data (which we can't do) and then tell us what those rules are that you think will work, and then we can help you program them.
So all I really needed from this address column is the City name and Zip code.
1. The city name can be one or two words, followed by State name and then a comma ","
2. Zip codes are only 5 numbers
3. Some address are missing everything except the state name
4. There are addresses from other states, but only interested in those that has Iowa as the state
Also, If zip codes is not blank, it is right after a comma
@alagiejatta
Parsing addresses is difficult, there's a good reply in a thread on Stack Overflow:
https://stackoverflow.com/questions/11160192/how-to-parse-freeform-street-postal-address-out-of-text...
At the end of the day, you will need to come up with some rules for your situation/data.
As @PaigeMiller states, you need to review your data and come up with rules that work for your data.
Looking at your 3 sample lines, you can strip out the phone number, and probably get the zip, as long as nobody has long zip format (50321-1234). Getting the state/city that's going to be problematic as per the thread linked above.
Sorry. City can actually be up to 3 words.
Thanks
Is Iowa always spelled out, all four letters, or is it sometimes abbreviated as IA?
Are there every 9 digit zip codes?
Are there separators between address and city name other than a blank?
Again, there are lots of possibilities and you have addressed a few, but I'm sure there are many complications that you haven't addressed yet.
Iowa is always spelled out in full (including other states)
There are no 9 digit zip. All are 5 digits
There are only blank separators between address and city name and between most of the elements. The only comma separator is after the state name. Column (:) separators are after the text " Address" and after the text "Phone Number"
Thank you
I hope you get some idea of how difficult this is in general (and I hope you read the link from @AMSAS )
In your first record, there seems to be no way to decide if the city name is "Moines" or "Des Moines" or "St Des Moines".
Some people use a list of place names to help them decide, especially if you have a zip code. That isn't specifically pulling apart the text string, as extra information is used. I don't have skills with such a procedure, but I'm sure someone here does.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.