BookmarkSubscribeRSS Feed
KyleM_Corrie
Fluorite | Level 6

I'm trying to find a method of comparison for distance between particular locations that outputs only those observations that meet specific criteria.

 

Take, for example, the following data set,

 

data HAVE;
  input Store Lat Lon;
  datalines;
  1164 42.30861 -83.44031
  1251 42.32282 -83.45289
  1927 41.67517 -87.7969
  6523 43.81187 -97.72712
  ;
run;

where Store assigns an identifier code to a given store, Lat identifies that store's geographical latitude, and Lon identifies the stores geographical longitude.

 

What I'd like to do is compare each store against every other store, calculate the distance between them using the GEODIST function, and then output only those stores which are within three miles of each other and not equal to zero.

 

comparisons.PNG

 

So, Store 1164 gets compared to Store 1164 and the GEODIST is calculated as 0.

Store 1164 gets compared to 1251 and the GEODIST is calculated as 1.17 miles.

Store 1164 gets compared to 1927 and the GEODIST is calculated as 228.55 miles.

Store 1164 gets compared to 6523 and the GEODIST is calculated as 729.79 miles.

 

Therefore, the only comparison which is output is the 1164 to 1251 comparison.

 

Then the process repeats and Store 1251 is compared to all other stores.

 

What I don't want to do is use multiple steps, create a Cartesian product using, another data set, and apply my logic qualifiers.

 

I was hoping someone far more clever than I was out who had a more efficient method.

 

I was wondering if it might be possible to put the HAVE data set in a SET statement, then use a DO loop, set the HAVE data set again, and calculate all distances then combine this with an OUTPUT statement to do everything in one go.

 

I'd appreciate any suggestions!

 

Thanks

 

P.S. - The GEODIST function is as follows for anyone who is not familiar:

 

=GEODIST(latitude1,longitude1,latitude2,longitude2, 'M')

5 REPLIES 5
Reeza
Super User

By definition what you're doing is a Cartesian product. 

KyleM_Corrie
Fluorite | Level 6

I'd like to find a method to accomplish this task without outputting a new data set and instead, hopefully, evaluate these conditions in the PDV.

 

Do you have any suggestions, or were you just looking for an opportunity to increase your post count?

Reeza
Super User

@KyleM_Corrie wrote:

 

Do you have any suggestions, or were you just looking for an opportunity to increase your post count?


 

Is there a reward for post count that I missed?!?

if you can define a problem your closer to solving it - ie can google efficient Cartesian product. 

 

The last time I dealt with this issue Using a double set with key was the most efficient solution. 

A hash solution may also be very efficient. Both of these would output a new data set though. 

KyleM_Corrie
Fluorite | Level 6

Thanks for the suggestions. I only mentioned the post count because I didn't see what other purpose your original post could have served.

 

The method I described was only mentioned to describe what it was I hoped to achieve. It has been the only way I can figure out how to make it work, but I'd like to imagine there are much better methods.

 

Currently, it requires me to import data that is stored in an XLSX file, create a new data set using PROC SQL which results in a Cartesian product, and then evaluate my conditions on the new data set in a different data step.

 

It's not pretty, but it gets the job done.

 

I was just curious if one of you guys had any way to make it pretty as well!

 

Again, thanks much for the suggestions. It could be that what I want to do just can't be done in the manner I'd like it to be done.

Reeza
Super User
Use The previous SQL query I provided with a WHERE clause (where dist<3). No extra data step, but not super efficient.

I used to do this in SAS but have since moved towards using GIS tools for these types of analysis. Theyre optimized, and its usually available via GUI. QGIS is free and open source. It used to be okay to approximate with geodist but nowadays driving distance and travel time are more important.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1924 views
  • 0 likes
  • 2 in conversation