BookmarkSubscribeRSS Feed
mspak
Quartz | Level 8

Hello,

I am searching for the most efficient way to calculate the closest distance to top 50 cities in US. I have all the zipcodes for all the firms (I refer this as x) and top cities (referred as y). If I were to arrange all the y in columes (which requires 50 columes), it will take a long time to compute the results. Then, I will have to get the lowest values among all the distances.

Is there any BETTER way to compute the closest distance between the firm and their respective closest city (among top 50 cities)?

Thank you.

Sandy

21 REPLIES 21
Rick_SAS
SAS Super FREQ

I hope you have SAS 9.2 or better.

You can use the ZIPCITYDISTANCE function to compute the distances: http://support.sas.com/kb/36/091.html

It shouldn't take too long: for each firm, find the minimum of the distances to the 50 citites. You can do this in a DATA step.

mspak
Quartz | Level 8

Thanks Rick Wicklin,

I am a user of SAS9.3. I found ways to calculate the distance between only TWO locations. I have a database for the geographic coordinates for all the zipcodes. I have to calculate the distances between the firms with all the top 50 cities, and then identify the closest top city (with distance) of the particular firms. I understand that there is a data step to select the lowest value.

This does mean that I have to present my longitudes and latitides in 102 columns [(50 cities + the firm) x 2]. Note: Each coordinates to be presented in latitude and longitude in pairs. 

My question is that:

Is there anyway to calculate distance between the firm with all the cities by merely creating one formula, instead of repeating the process for 50 times for each firm, preferably with an option to select the lowest value (shortest distance) among distances between all the 50 cities and the firm.

Thank you.

Regards,

Sandy

GraphGuy
Meteorite | Level 14

You wouldn't want to have 50 separate variables for the 50 cities, but rather make the cities 'values' of the variables. Here's a small sample that demonstrates basically what you want to do (since I don't have your cities list, I'm just grabbing some zipcodes out of the sashelp.zipcode file)...

%let myzip=27513;

data cities (keep=statecode city zip);
set sashelp.zipcode (where=(statecode='NC'));
run;

data cities; set cities;
distance=zipcitydistance(&myzip,zip);
run;

proc sort data=cities out=cities;
by distance;
run;

data cities; set cities (obs=50);
run;

Rick_SAS
SAS Super FREQ

Use this example that I created:

data Targets;

input Target @@; /* NYC CHICAGO DALLAS */

datalines;

10027 60601 75211

;

data Cities;

input City @@;

datalines;

96106 49039 01341 13464 40206

15281 31515 67859 64465 80234

25672 45340 26266 77641 97306

92885 57580 36312 95070 29940

25165 16424 39423 76528 06611

;

proc sql; /* combine the two tables */

create table All as

   select * from Targets, Cities;

quit;

data dist;

set All;

dist = zipcitydistance(Target, City);

run;

art297
Opal | Level 21

My suggestion is similar to Robert's but, if you need to calculate the proximities between all of the cities, I would use proc sql.  E.g., the following (I only used a 5 city database, but it could just as well as contained 50 or more) does all of the calculations in one simple step:

data have;

  informat city $25.;

  input city & zip;

  cards;

Chicago  60626

Albany  12209

Los Angeles  90002

New York  10278

Detroit  48243

San Francisco  94104

;

proc sql;

  create table want as

    select *,

           zipcitydistance(zip,zip2) as distance

      from have,have (rename=(city=city2 zip=zip2))

        where city ne city2

          order by city, distance

  ;

quit;

Darrell_sas
SAS Employee

These seem like reasonable suggestions.   The only other suggestion I have is that you could attempt to subset your comparisons by making intelligent choices of cities to compare against.  You have the ZIP Codes and the numbers do have meaning.  The first digit indicates the region of the country.  2, for example, is DC, Maryland, Virginia, West Virginia, NC, and SC.  You might compare zipcodes starting with '2' to the cities in this region and perhaps the next region or so.  But skip the regions that are all the way on the west coast.  However, I'd probably only do this if you have a very large number of firms/ZIP Codes to compare, are having performance issues and want to try additional tweaks.

This link gives some information on the ZIP Code regions:

http://www.zippostalcodes.com/postcodes/us/us-zip-codes-format/

ballardw
Super User

I think at this point that MSPAK needs to provide an example of 4 or 5 cities, firms and what the output should look like and definition of "top city".

mspak
Quartz | Level 8

Thanks for the helps!!!

I herewith attached the data as example.

The following SAS files attached:

File 1: usa1

This file contains the company ticker no. (tic), financial year (fyear), address zipcodes (ADDZIP) and so on.

File2: metro_rank

This file contains zipcodes with their latitudes, longitudes and the last column is the ranking for the metropolitans. I used this rank for the top cities.

My main task:

I have to calculate the distance between the firms (in file 1) with the closest top 50 cities (in file 2). At this point, I still do not have any idea which city is the closest to the firm.

My expected output:

I wish to have an additional variable (Column) which included in File 1, namely, the distance (in miles) to the closest city for each firm (which identified by tic).

I think I have to make my point clearer at this point.

Once again, thank you very much for helps.

Regards,

MSPAK

art297
Opal | Level 21

I can only help you through part of your problem.  Your USA1 file contains a mixture of zipcodes and postal codes from both Canada and Europe.

The following will work to solve your problem for just the records that have US zipcodes.  However, if you need the match the Canadian and European cities with US cities, you'll have to obtain the grid coordinates for all of the cities and do some modifications to the code.  However, the approach should still work, and it only took about 2 minutes to accomplish on a fairly old Windows machine:

libname thedata "c:\";

data part_usa1 (keep=tic partzip city);

  length partzip $5;

  set thedata.usa1;

  partzip=strip(substr(addzip,1,5));

  if length(partzip) eq 5 and

   notdigit(partzip) eq 0;

run;

proc sort data=part_usa1 nodupkey;

  by partzip;

run;

proc sql;

  create table temp (keep=tic bcity distance sum_rank) as

    select *,

           zipcitydistance(a.partzip,b.zipcode) as distance

      from part_usa1 as a,

           thedata.Metro_rank (keep=ZipCode City sum_rank

                               rename=(City=bCity)) as b

        where city ne bCity and not missing(calculated distance)

          order by tic, distance

  ;

quit;

data need (drop=bCity sum_rank distance);

  set temp;

  array cities(50) $35.;

  array ranks(50);

  array distances(50);

  retain cities: ranks: distances:;

  by tic;

  if first.tic then n=1;

  else n+1;

  if n le 50 then do;

    cities(n)=bCity;

    ranks(n)=sum_rank;

    distances(n)=distance;

  end;

  if n eq 50 then output;

run;

proc sort data=thedata.usa1 out=usa1;

  by tic;

run;

data want;

  merge need usa1;

  by tic;

run;

mspak
Quartz | Level 8

Thanks for suggestions,

I will eventually delete all those firms are not in US as my research sample should only include US firms. My another question is:

Should I convert all the longitude and latitude (which are currently in decimals) into radians before your program suggested?

Thanks again.

art297
Opal | Level 21

The code I suggested didn't use the grid coordinates at all but, rather, only used the zip codes to determine distance.  That is why it had to omit the non-US entries.

art297
Opal | Level 21

Upon looking at Robert's code I realized that I didn't provide any comments or explanations regarding my own code. Hey, at 1:15am, I couldn't stay awake any longer.

The code I proposed results in a file, called work.want, the contains all of the records from USA1, with the addition of 150 fields on each record.

cities1 is the city closest to the firm and ranks1 is its rank and distances1 shows the number of miles it is from the firm.  The second closest is cities2, etc. and the 50th closest is cities50.

GraphGuy
Meteorite | Level 14

Sounds like you've got a bit of data-cleaning to do, but ...

Using the data you provided, and assuming that this is the problem statement...

For each company (designated by a tic), find the closest city which is in the "top 50" cities.

Here is some code.  I've added some comments to let you know what's going on at each step along the way, and of course each step could be done in several different ways (depending on personal preference, the need for efficiency-vs-ease, etc):

libname mydata 'somewhere';

/* only keep 1 obsn per tic (don't need all the fiscal years */
/* 12,913 obsns */
proc sql;
create table usa1 as
select unique tic, addzip
from mydata.usa1;
quit; run;

/* create 'zip' variable with 5-digit numeric zip code */
/* also do something to also limit these to US-only */
data usa1; set usa1;
format zip z5.;
zip=.; zip=substr(addzip,1,5);
if zip^=. then output;
run;

/* only keep cities in the "top 50" */
data metro_rank; set mydata.metro_rank;
metro_zip=.; metro_zip=substr(zipcode,1,5);
if sum_rank^=. and sum_rank<=50 and metro_zip^=. then output;
run;

/* get every possible pair of company-to-city */
/* this produces about 607,215 obsns */
proc sql;
create table pairs as
select
usa1.tic, usa1.zip,
metro_rank.metro_zip, metro_rank.city, metro_rank.sum_rank
from usa1, metro_rank
where usa1.zip^=metro_rank.metro_zip;
quit; run;

/* calculate the distance between each tic's zip, and the metro_zips */
/* this takes about 11 seconds on my laptop */
data pairs; set pairs;
distance=zipcitydistance(zip,metro_zip);
if distance^=. then output;
run;

/* For each tic, get the obsn with the shortest distance */

/* produces 11,680 obsns */

proc sql;

create table closest as

select unique *

from pairs

group by tic

having sum_rank = min(sum_rank);

quit; run;

mspak
Quartz | Level 8

Thanks for all useful suggestions,

Based on the article entitled "Driving distance and Time using SAS and Google Maps" (refer:http://support.sas.com/resources/papers/proceedings10/050-2010.pdf), I find that I can calculate the driving distance between 2 points. However, the examples provided in the article is a small sample,

If I would like to calculate the DRIVE TIME between the firm (identified by tic) and the closest top 50 cities. Note: Using the SAS data files I uploaded. 

What would be the solution? Is it possible to handle this task efficiently?

Thank you.

Regards,

MEI SEN

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 3933 views
  • 6 likes
  • 7 in conversation