Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Geospatial analysis is one of my favourite types of analysis -- being able to take potentially complex, messy, and big data and see how it fits onto a map is just so cool. Whether it’s looking at people who have come down with the flu, broken watermains, or traffic accidents, seeing it on a map can be much more effective that seeing it on a “regular” graph.
Unfortunately, SAS University Edition does not come with the powerful geographic analytical tools that are standard in Base SAS, nor does University Edition have the ability to create the maps that can be done. Having said that, there are some cool things you can do with some creativity, patience, and a bunch of SQL.
Get the Data
I’m using a crowdsourced dataset on campgrounds – I used to love camping as a kid, and was really excited to see this data. The original dataset is multiple CSV files, which I merged, cleaned (I didn’t need all the variables, and there was 1 duplicate record) and imported into SAS University Edition. The data can be downloaded here, and I recommend you copy / paste the list of short forms from the site into your code for easy reference.
Getting the data ready
Other than merging the data into a single Excel file, there wasn’t much I had to do. Before I imported it, I used the Excel “delete duplicates” tool (which removed the 1 record I mentioned above) and then imported it into the University Edition environment.
So one of the key questions you can answer using geospatial analysis is the entity (campground, library, school, etc.) that is closest to another entity (your location, all the other campgrounds / libraries / schools, etc.).
There are a few different methods to determining distance between two places; refer to Calculating Geographic Distance: Concepts and Methods by Frank Ivis to see detailed examples of the more common ones. I have typically used “straight line” distance, as it’s the least computer intensive and does provide a decent approximation. I have always had the most success
using the Haversine formula, as the distances are always closest to what I get using mapping software. This is the method I use below.
Now that we have our campground data, I want to see what the closest campgrounds are that meet specific parameters. Let’s take a look at the code and break it down.
proc sql; select /* -43.670, -79.386 */ /* 1 */ /* (note – the (constant(‘pi’)
is the way you use Pi in SAS, and works for e, i and other constants you may need */ 6371*(2 * arsin(min(1,sqrt(SIN( ((LAT-43.670)*(constant('pi')/180))/2)**2 + cos(43.670*(constant('pi')/180)) * cos(lat*(constant('pi')/180)) * sin ( ((lon-(-79.386))*(constant('pi')/180)/2)**2))))) as Dist_KM, /* 2 */ 3959*(2 * arsin(min(1,sqrt(SIN( ((LAT-43.670)*(constant('pi')/180))/2)**2 + cos(43.670*(constant('pi')/180)) * cos(lat*(constant('pi')/180)) * sin ( ((lon-(-79.386))*(constant('pi')/180)/2)**2))))) as Dist_Miles, /* 3 */ name, prov_state, town from work.IMPORT /* 4 */ where amenities like '%FT%' and amenities like '%NP%'; quit;
So I am Canadian, born, raised and living in Toronto for most of my life. So in my example, I’m using the main intersection in Toronto (Yonge, pronounced young, and Bloor) as my starting point. I add the latitude and longitude in my comments at the very top so I can copy / paste them if needed.
In the first section, I am using the Haversine formula to calculate the distance in kilometers. You’ll see I have my latitude and longitude as part of the calculation; if you’re running this against two datasets, just replace the numbers with the names of your variables. I took the original code from Ivis’ paper and modified it. In his paper, he uses the data step, and runs each part separately. I prefer SQL and validating each step separately, but find it to be more efficient to run the whole thing at once.
The second section is basically the same as the first, with the exception that it’s calculating the distance in miles. You’ll notice that the only difference is the first number – this is the radius of the Earth (in km or miles, depending on what you’re calculating).
The third section is just a couple of variables from my data that I want included in my output, and the final section are my parameters. If you refer to the legend on the site, you’ll see I’m looking for all campgrounds with flushing toilets and no pets allowed.
When I run the code, I get exactly two results. Because the data is crowdsourced, it is not 100% complete so there may be plenty more campgrounds that meet our parameters, but these are the ones in our data.
Depending on your requirements, you can tweak the SQL (where amenities like '%FT%'
OR amenities like '%NP%'; where amenities NOT like '%FT%' and amenities NOT like '%NP%'; etc.) – play around and see what you can find!
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.