Picking up where we left off with the last Free Data Friday post (and keying off Chris Hemedinger‘s comment), I wanted to go a little deeper into the geospatial analysis of the campground data we last analyzed.
Get the data
Download the data from here, and I recommend you copy / paste the list of short forms from the site into your code for easy reference.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials.
Additional resources are available in this article.
Getting the data ready
As I mentioned last week, other than merging the data into a single Excel file, there wasn’t much I had to do. Before I imported it, I used the Excel “delete duplicates” tool (which removed the 1 record I mentioned above) and then imported it into the University Edition environment.
First, let me address what Chris mentioned in his comment, that the GEODIST function allows you to use your data or provide the latitudes / longitudes and calculate the distance (in kilometers or miles). This is a much easier way to do it!
Here’s the code, more or less the same with the exception of the Distance calculations. This new code is much easier to read, and more efficient from a resource perspective.
proc sql; select geodist(lat, lon, 43.670, -79.386, 'K') as dist_K, geodist(lat, lon, 43.670, -79.386, 'M') as dist_M, name, prov_state, town from work.IMPORT where amenities like '%FT%' and amenities like '%NP%'; quit;
Here’s the results. You can see that they're the same as last week’s, so from a validity and reliability aspect, we’re good!
The other analytical technique I wanted to show is taking a table and joining it onto itself. This is useful when you want to compare the values to themselves, and in the example I have, I’m calculating the distance between each campsite to every other campsite. I have put limits on the data to keep the running time to a minimum, and enough data is returned to show what I’m referring to.
Here's the full code (we'll go through it, I just want you to see the whole thing):
proc sql; create table work.test as select a.name as name1, a.prov_state as prov_state1, b.name as name2, b.prov_state as prov_state2, geodist(a.lat, a.lon, b.lat, b.lon, 'K') as dist_K from work.import a , work.import b where a.name<>b.name and geodist(a.lat, a.lon, b.lat, b.lon, 'K')>0 and geodist(a.lat, a.lon, b.lat, b.lon,'K')<100; select * from work.test where dist_K<30; quit;
In breaking the code down, looking at the FROM statement first, you’ll see I have the work.import table referenced twice. SQL as a language has a useful feature called aliasing, which enables you to refer to a table as something else; in this case, I’m using a and b. This allows SAS to distinguish between the tables, columns, and values.
from work.import a , work.import b
Next, the select statement has the list of the variables I want.
select a.name as name1, a.prov_state as prov_state1, b.name as name2, b.prov_state as prov_state2, geodist(a.lat, a.lon, b.lat, b.lon, 'K') as dist_K
You’ll note that this time, I have a. or b. in front of the variable names – this is because the same columns appear in both tables (because it’s the same table twice) and I have to distinguish which is which. You’ll see that I’ve done the same with the GEODIST function as well; a notable difference is that in previous examples, I’ve used the 43.670, -79.386 coordinates as the second point. In this example, because I’m comparing each camp to every other camp, I use the LAT and LON from each instance of the table.
Finally, I’m creating a new table (work.test) here:
create table work.test as
and as I mentioned I’m limiting the data in the Where clause:
where a.name<>b.name and geodist(a.lat, a.lon, b.lat, b.lon, 'K')>0 and geodist(a.lat, a.lon, b.lat, b.lon,'K')<100;
First, I’m saying where the names do not match (I don’t want Jones’ Camp Ground compared to itself). I’m also saying where the distance is greater than 0 (if A -> B is 15, I don’t need to know that B -> A is -15). Greater than 0 also removes the possibility of campsites that have been entered twice but with slightly different names (Jones’ Camp Ground and Jones Camp Site, for example). The first criteria would have skipped this, and the >0 criteria would remove them. The final criteria is to limit the distance to <100 kilometers.
Because I have put this data into a temporary table (work.test), I need to now display what I’ve done, and that’s the single line query at the bottom.
select * from work.test where dist_K<30;
I added the distance <30 kilometers criteria just to speed up the query, but to give a variety of results.
Because the data is sorted alphabetically by PROV_STATE then by Camp, the sites in Alberta come up first. You’ll see there seems to be quite a few camps within close proximity to a camp called Anthony Henday, which may be of interest to someone that is looking to do some travelling around the area.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.