Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Picking up where we left off with the last Free Data Friday post (and keying off Chris Hemedinger‘s comment), I wanted to go a little deeper into the geospatial analysis of the campground data we last analyzed.
Get the data
Download the data from here, and I recommend you copy / paste the list of short forms from the site into your code for easy reference.
Getting the data ready
As I mentioned last week, other than merging the data into a single Excel file, there wasn’t much I had to do. Before I imported it, I used the Excel “delete duplicates” tool (which removed the 1 record I mentioned above) and then imported it into the University Edition environment.
First, let me address what Chris mentioned in his comment, that the GEODIST function allows you to use your data or provide the latitudes / longitudes and calculate the distance (in kilometers or miles). This is a much easier way to do it!
Here’s the code, more or less the same with the exception of the Distance calculations. This new code is much easier to read, and more efficient from a resource perspective.
proc sql; select geodist(lat, lon, 43.670, -79.386, 'K') as dist_K, geodist(lat, lon, 43.670, -79.386, 'M') as dist_M, name, prov_state, town from work.IMPORT where amenities like '%FT%' and amenities like '%NP%'; quit;
Here’s the results. You can see that they're the same as last week’s, so from a validity and reliability aspect, we’re good!
The other analytical technique I wanted to show is taking a table and joining it onto itself. This is useful when you want to compare the values to themselves, and in the example I have, I’m calculating the distance between each campsite to every other campsite. I have put limits on the data to keep the running time to a minimum, and enough data is returned to show what I’m referring to.
Here's the full code (we'll go through it, I just want you to see the whole thing):
proc sql; create table work.test as select a.name as name1, a.prov_state as prov_state1, b.name as name2, b.prov_state as prov_state2, geodist(a.lat, a.lon, b.lat, b.lon, 'K') as dist_K from work.import a , work.import b where a.name<>b.name and geodist(a.lat, a.lon, b.lat, b.lon, 'K')>0 and geodist(a.lat, a.lon, b.lat, b.lon,'K')<100; select * from work.test where dist_K<30; quit;
In breaking the code down, looking at the FROM statement first, you’ll see I have the work.import table referenced twice. SQL as a language has a useful feature called aliasing, which enables you to refer to a table as something else; in this case, I’m using a and b. This allows SAS to distinguish between the tables, columns, and values.
from work.import a , work.import b
Next, the select statement has the list of the variables I want.
select a.name as name1, a.prov_state as prov_state1, b.name as name2, b.prov_state as prov_state2, geodist(a.lat, a.lon, b.lat, b.lon, 'K') as dist_K
You’ll note that this time, I have a. or b. in front of the variable names – this is because the same columns appear in both tables (because it’s the same table twice) and I have to distinguish which is which. You’ll see that I’ve done the same with the GEODIST function as well; a notable difference is that in previous examples, I’ve used the 43.670, -79.386 coordinates as the second point. In this example, because I’m comparing each camp to every other camp, I use the LAT and LON from each instance of the table.
Finally, I’m creating a new table (work.test) here:
create table work.test as
and as I mentioned I’m limiting the data in the Where clause:
where a.name<>b.name and geodist(a.lat, a.lon, b.lat, b.lon, 'K')>0 and geodist(a.lat, a.lon, b.lat, b.lon,'K')<100;
First, I’m saying where the names do not match (I don’t want Jones’ Camp Ground compared to itself). I’m also saying where the distance is greater than 0 (if A -> B is 15, I don’t need to know that B -> A is -15). Greater than 0 also removes the possibility of campsites that have been entered twice but with slightly different names (Jones’ Camp Ground and Jones Camp Site, for example). The first criteria would have skipped this, and the >0 criteria would remove them. The final criteria is to limit the distance to <100 kilometers.
Because I have put this data into a temporary table (work.test), I need to now display what I’ve done, and that’s the single line query at the bottom.
select * from work.test where dist_K<30;
I added the distance <30 kilometers criteria just to speed up the query, but to give a variety of results.
Because the data is sorted alphabetically by PROV_STATE then by Camp, the sites in Alberta come up first. You’ll see there seems to be quite a few camps within close proximity to a camp called Anthony Henday, which may be of interest to someone that is looking to do some travelling around the area.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.