I have a list of city names in the U.S. They are typed in by end-users. A lot of them contains minor typos. See below
Is there an internal list of U.S. city names in the SAS that I can compare my city name list to the correct list (maybe by using COMPLEV???) in order to minimize distance and then
standardize city names. Alternatively, is there a function that transform the first column into code or standard city names so that I can do proc geocode on them?
I tried to manually correct these, almost killed myself...
Thanks a lot, guys!
Chicago, | Chicago |
Chicao | Chicago |
Chichago | Chicago |
Cincinatti | CINCINNATI |
Cincinnati | CINCINNATI |
Cincinnatti | CINCINNATI |
Cinncinati | CINCINNATI |
There's the maps.uscity table that has City Names, and lat/long.
But city names aren't unique so you need to be careful,
proc freq data=maps.uscity noprint;
table city/out=city_names;
run;
proc sort data=city_names; by descending count ;
run;
Thank you for your help -
Another consideration: Many city names are similar. For example, if you see the misspelling "Parris", there are many legitimate city names that could match it: Paris, Perris, Parrish. Would you like a computer to make the decision for you?
One viable approach would be to take all the corrections that you so studiously made already and automate them. Apply the same changes you made to the next batch of data. Then see which spellings still don't match a legitimate city and augment your list.
Good luck.
There should be a ZIP code data set SASHelp.Zipcode that has city and state information including some "alternate" city names associated. If by any chance your data also contains zip codes the function ZIPCity may be of interest to you.
Good luck.
Unfortunately, the zip code variable is not available. But I also have state variable I can use to triangulate. Could I ask in this case, what would be a better solution? Thank you, guys -
My general approach would be to find the matches by state on one or more of the city variables in the ZIP code data set first. For the ones that don't match I would start by examining the difference between the (distinct) cities in the data with ZIP code set looking for the minimum COMPGED or COMPLEV and/ or maximimum COMPARE function results.
That is a good method also, thanks!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.