Geospatial analysis is one of my favourite types of analysis -- being able to take potentially complex, messy, and big data and see how it fits onto a map is just so cool. Whether it’s looking at people who have come down with the flu, broken watermains, or traffic accidents, seeing it on a map can be much more effective that seeing it on a “regular” graph.
Unfortunately, SAS University Edition does not come with the powerful geographic analytical tools that are standard in Base SAS, nor does University Edition have the ability to create the maps that can be done. Having said that, there are some cool things you can do with some creativity, patience, and a bunch of SQL.
Get the Data
I’m using a crowdsourced dataset on campgrounds – I used to love camping as a kid, and was really excited to see this data. The original dataset is multiple CSV files, which I merged, cleaned (I didn’t need all the variables, and there was 1 duplicate record) and imported into SAS University Edition. The data can be downloaded here, and I recommend you copy / paste the list of short forms from the site into your code for easy reference.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
Other than merging the data into a single Excel file, there wasn’t much I had to do. Before I imported it, I used the Excel “delete duplicates” tool (which removed the 1 record I mentioned above) and then imported it into the University Edition environment.
So one of the key questions you can answer using geospatial analysis is the entity (campground, library, school, etc.) that is closest to another entity (your location, all the other campgrounds / libraries / schools, etc.).
There are a few different methods to determining distance between two places; refer to Calculating Geographic Distance: Concepts and Methods by Frank Ivis to see detailed examples of the more common ones. I have typically used “straight line” distance, as it’s the least computer intensive and does provide a decent approximation. I have always had the most success
using the Haversine formula, as the distances are always closest to what I get using mapping software. This is the method I use below.
Now that we have our campground data, I want to see what the closest campgrounds are that meet specific parameters. Let’s take a look at the code and break it down.
proc sql; select /* -43.670, -79.386 */ /* 1 */ /* (note – the (constant(‘pi’)
is the way you use Pi in SAS, and works for e, i and other constants you may need */ 6371*(2 * arsin(min(1,sqrt(SIN( ((LAT-43.670)*(constant('pi')/180))/2)**2 + cos(43.670*(constant('pi')/180)) * cos(lat*(constant('pi')/180)) * sin ( ((lon-(-79.386))*(constant('pi')/180)/2)**2))))) as Dist_KM, /* 2 */ 3959*(2 * arsin(min(1,sqrt(SIN( ((LAT-43.670)*(constant('pi')/180))/2)**2 + cos(43.670*(constant('pi')/180)) * cos(lat*(constant('pi')/180)) * sin ( ((lon-(-79.386))*(constant('pi')/180)/2)**2))))) as Dist_Miles, /* 3 */ name, prov_state, town from work.IMPORT /* 4 */ where amenities like '%FT%' and amenities like '%NP%'; quit;
So I am Canadian, born, raised and living in Toronto for most of my life. So in my example, I’m using the main intersection in Toronto (Yonge, pronounced young, and Bloor) as my starting point. I add the latitude and longitude in my comments at the very top so I can copy / paste them if needed.
In the first section, I am using the Haversine formula to calculate the distance in kilometers. You’ll see I have my latitude and longitude as part of the calculation; if you’re running this against two datasets, just replace the numbers with the names of your variables. I took the original code from Ivis’ paper and modified it. In his paper, he uses the data step, and runs each part separately. I prefer SQL and validating each step separately, but find it to be more efficient to run the whole thing at once.
The second section is basically the same as the first, with the exception that it’s calculating the distance in miles. You’ll notice that the only difference is the first number – this is the radius of the Earth (in km or miles, depending on what you’re calculating).
The third section is just a couple of variables from my data that I want included in my output, and the final section are my parameters. If you refer to the legend on the site, you’ll see I’m looking for all campgrounds with flushing toilets and no pets allowed.
When I run the code, I get exactly two results. Because the data is crowdsourced, it is not 100% complete so there may be plenty more campgrounds that meet our parameters, but these are the ones in our data.
Depending on your requirements, you can tweak the SQL (where amenities like '%FT%'
OR amenities like '%NP%'; where amenities NOT like '%FT%' and amenities NOT like '%NP%'; etc.) – play around and see what you can find!
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.