02-01-2012 12:01 AM
I am searching for the most efficient way to calculate the closest distance to top 50 cities in US. I have all the zipcodes for all the firms (I refer this as x) and top cities (referred as y). If I were to arrange all the y in columes (which requires 50 columes), it will take a long time to compute the results. Then, I will have to get the lowest values among all the distances.
Is there any BETTER way to compute the closest distance between the firm and their respective closest city (among top 50 cities)?
02-01-2012 06:26 AM
I hope you have SAS 9.2 or better.
You can use the ZIPCITYDISTANCE function to compute the distances: http://support.sas.com/kb/36/091.html
It shouldn't take too long: for each firm, find the minimum of the distances to the 50 citites. You can do this in a DATA step.
02-01-2012 07:57 AM
Thanks Rick Wicklin,
I am a user of SAS9.3. I found ways to calculate the distance between only TWO locations. I have a database for the geographic coordinates for all the zipcodes. I have to calculate the distances between the firms with all the top 50 cities, and then identify the closest top city (with distance) of the particular firms. I understand that there is a data step to select the lowest value.
This does mean that I have to present my longitudes and latitides in 102 columns [(50 cities + the firm) x 2]. Note: Each coordinates to be presented in latitude and longitude in pairs.
My question is that:
Is there anyway to calculate distance between the firm with all the cities by merely creating one formula, instead of repeating the process for 50 times for each firm, preferably with an option to select the lowest value (shortest distance) among distances between all the 50 cities and the firm.
02-01-2012 08:15 AM
You wouldn't want to have 50 separate variables for the 50 cities, but rather make the cities 'values' of the variables. Here's a small sample that demonstrates basically what you want to do (since I don't have your cities list, I'm just grabbing some zipcodes out of the sashelp.zipcode file)...
data cities (keep=statecode city zip);
set sashelp.zipcode (where=(statecode='NC'));
data cities; set cities;
proc sort data=cities out=cities;
data cities; set cities (obs=50);
02-01-2012 08:24 AM
Use this example that I created:
input Target @@; /* NYC CHICAGO DALLAS */
10027 60601 75211
input City @@;
96106 49039 01341 13464 40206
15281 31515 67859 64465 80234
25672 45340 26266 77641 97306
92885 57580 36312 95070 29940
25165 16424 39423 76528 06611
proc sql; /* combine the two tables */
create table All as
select * from Targets, Cities;
dist = zipcitydistance(Target, City);
02-01-2012 08:26 AM
My suggestion is similar to Robert's but, if you need to calculate the proximities between all of the cities, I would use proc sql. E.g., the following (I only used a 5 city database, but it could just as well as contained 50 or more) does all of the calculations in one simple step:
informat city $25.;
input city & zip;
Los Angeles 90002
New York 10278
San Francisco 94104
create table want as
zipcitydistance(zip,zip2) as distance
from have,have (rename=(city=city2 zip=zip2))
where city ne city2
order by city, distance
02-01-2012 08:56 AM
These seem like reasonable suggestions. The only other suggestion I have is that you could attempt to subset your comparisons by making intelligent choices of cities to compare against. You have the ZIP Codes and the numbers do have meaning. The first digit indicates the region of the country. 2, for example, is DC, Maryland, Virginia, West Virginia, NC, and SC. You might compare zipcodes starting with '2' to the cities in this region and perhaps the next region or so. But skip the regions that are all the way on the west coast. However, I'd probably only do this if you have a very large number of firms/ZIP Codes to compare, are having performance issues and want to try additional tweaks.
This link gives some information on the ZIP Code regions:
02-01-2012 05:57 PM
I think at this point that MSPAK needs to provide an example of 4 or 5 cities, firms and what the output should look like and definition of "top city".
02-01-2012 11:57 PM
Thanks for the helps!!!
I herewith attached the data as example.
The following SAS files attached:
File 1: usa1
This file contains the company ticker no. (tic), financial year (fyear), address zipcodes (ADDZIP) and so on.
This file contains zipcodes with their latitudes, longitudes and the last column is the ranking for the metropolitans. I used this rank for the top cities.
My main task:
I have to calculate the distance between the firms (in file 1) with the closest top 50 cities (in file 2). At this point, I still do not have any idea which city is the closest to the firm.
My expected output:
I wish to have an additional variable (Column) which included in File 1, namely, the distance (in miles) to the closest city for each firm (which identified by tic).
I think I have to make my point clearer at this point.
Once again, thank you very much for helps.
02-02-2012 01:16 AM
I can only help you through part of your problem. Your USA1 file contains a mixture of zipcodes and postal codes from both Canada and Europe.
The following will work to solve your problem for just the records that have US zipcodes. However, if you need the match the Canadian and European cities with US cities, you'll have to obtain the grid coordinates for all of the cities and do some modifications to the code. However, the approach should still work, and it only took about 2 minutes to accomplish on a fairly old Windows machine:
libname thedata "c:\";
data part_usa1 (keep=tic partzip city);
length partzip $5;
if length(partzip) eq 5 and
notdigit(partzip) eq 0;
proc sort data=part_usa1 nodupkey;
create table temp (keep=tic bcity distance sum_rank) as
zipcitydistance(a.partzip,b.zipcode) as distance
from part_usa1 as a,
thedata.Metro_rank (keep=ZipCode City sum_rank
rename=(City=bCity)) as b
where city ne bCity and not missing(calculated distance)
order by tic, distance
data need (drop=bCity sum_rank distance);
array cities(50) $35.;
retain cities: ranks: distances:;
if first.tic then n=1;
if n le 50 then do;
if n eq 50 then output;
proc sort data=thedata.usa1 out=usa1;
merge need usa1;
02-02-2012 01:29 AM
Thanks for suggestions,
I will eventually delete all those firms are not in US as my research sample should only include US firms. My another question is:
Should I convert all the longitude and latitude (which are currently in decimals) into radians before your program suggested?
02-02-2012 08:29 AM
The code I suggested didn't use the grid coordinates at all but, rather, only used the zip codes to determine distance. That is why it had to omit the non-US entries.
02-02-2012 09:52 AM
Upon looking at Robert's code I realized that I didn't provide any comments or explanations regarding my own code. Hey, at 1:15am, I couldn't stay awake any longer.
The code I proposed results in a file, called work.want, the contains all of the records from USA1, with the addition of 150 fields on each record.
cities1 is the city closest to the firm and ranks1 is its rank and distances1 shows the number of miles it is from the firm. The second closest is cities2, etc. and the 50th closest is cities50.
02-02-2012 09:27 AM
Sounds like you've got a bit of data-cleaning to do, but ...
Using the data you provided, and assuming that this is the problem statement...
For each company (designated by a tic), find the closest city which is in the "top 50" cities.
Here is some code. I've added some comments to let you know what's going on at each step along the way, and of course each step could be done in several different ways (depending on personal preference, the need for efficiency-vs-ease, etc):
libname mydata 'somewhere';
/* only keep 1 obsn per tic (don't need all the fiscal years */
/* 12,913 obsns */
create table usa1 as
select unique tic, addzip
/* create 'zip' variable with 5-digit numeric zip code */
/* also do something to also limit these to US-only */
data usa1; set usa1;
format zip z5.;
if zip^=. then output;
/* only keep cities in the "top 50" */
data metro_rank; set mydata.metro_rank;
if sum_rank^=. and sum_rank<=50 and metro_zip^=. then output;
/* get every possible pair of company-to-city */
/* this produces about 607,215 obsns */
create table pairs as
metro_rank.metro_zip, metro_rank.city, metro_rank.sum_rank
from usa1, metro_rank
/* calculate the distance between each tic's zip, and the metro_zips */
/* this takes about 11 seconds on my laptop */
data pairs; set pairs;
if distance^=. then output;
/* For each tic, get the obsn with the shortest distance */
/* produces 11,680 obsns */
create table closest as
select unique *
group by tic
having sum_rank = min(sum_rank);
02-02-2012 10:24 PM
Thanks for all useful suggestions,
Based on the article entitled "Driving distance and Time using SAS and Google Maps" (refer:http://support.sas.com/resources/papers/proceedings10/050-2010.pdf), I find that I can calculate the driving distance between 2 points. However, the examples provided in the article is a small sample,
If I would like to calculate the DRIVE TIME between the firm (identified by tic) and the closest top 50 cities. Note: Using the SAS data files I uploaded.
What would be the solution? Is it possible to handle this task efficiently?