Today's Free Data Friday post revisits PROC SQL and we'll delve into museum data.
Museums have always fascinated me. I'm intrigued by places dedicated to preserving artifacts about people I'll never meet and places I'll probably not get to visit. Canada has a large number of museums, ranging from agriculture to space and everything in between. I've visited many, but definitely want to see more, especially the Canada Science and Technology Museum!
Get the Data
I found a data set that includes all artifacts from the Canada Agriculture and Food Museum, Canada Aviation and Space Museum, and the Canada Science and Technology Museum. You can get the data from here.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
Because the data uses pipes (|) as the way to separate the columns, the regular Import task won't work, so I used PROC IMPORT:
PROC IMPORT OUT= WORK.WANT DATAFILE= '/folders/myfolders/musuems.csv' DBMS=DLM REPLACE; DELIMITER='|'; GETNAMES=YES; DATAROW=2; RUN;
I've found myself needing to do some "acrobatics" with some data over the past week and PROC SQL really saved me. I used a lot of these to save time, and I realised that because some of them are undocumented or lesser known, I should post about them.
The first is an easy way to catch typical typos; it's not as good as PERL, but for basic checks it's easy to use.
PROC SQL; SELECT artifactNumber, GeneralDescription FROM work.want WHERE GeneralDescription =* 'fiber'; QUIT;
You'll see that instead of the regular = 'fiber', I have put an * after the equals sign. This allows SAS to search for anything that's similar to your criteria.
You'll notice that not only does it recognise "fibre," but also all other capitalizations of the word.
We're all familiar with the basic math functions, but sometimes it's more efficient and easier to just use a function.
PROC SQL; SELECT artifactNumber, begindate, enddate, RANGE(begindate, enddate) AS Years FROM work.want WHERE group1='Communications'; QUIT;
Instead of doing EndDate - BeginDate, I used RANGE(), which I use if I have a lot of calculations to perform on a number of variables. (I find it easier to type).
The last two tips I wanted to share get more into the debugging / troubleshooting side of PROC SQL. While very handy, they can get you started tracking down pesky issues. I highly recommend reading this paper for more information.
proc sql _method outobs=10; select artifactNumber, begindate, enddate from work.want; quit;
There are no surprises to the results; what's interesting is what is returned in the log.
SAS is telling me that it's used a SELECT statement or clause on the WORK.WANT table, which we knew, but I wanted a simple example.
If you use the _METHOD on more complex queries, you'll see a wide range of other descriptors; here's the full list (copied from the paper above):
/* Code Description SQXCRTA Create table as Select. SQXSLCT Select statement or clause. SQXJSL Step loop join (Cartesian). SQXJM Merge join operation. SQXJNDX Index join operation. SQXJHSH Hash join operation. SQXSORT Sort operation. SQXSRC Source rows from table. SQXFIL Rows filtration. SQXSUMG Summary stats (aggregates) with GROUP BY clause. SQXSUMN Summary stats with no GROUP BY clause. */
The final trick I wanted to show you is more just to spark your interest. I want to do a more involved post on this, as it can get fairly complex. Still, it's a very useful tool to have in your toolbox.
One of the features of programming languages is how they actually do what they do. I have spent many hours looking at Execution Plans in SQL Server, network packet analysis, etc. In PROC SQL, we have _TREE:
PROC SQL _TREE; SELECT artifactNumber, begindate, enddate FROM work.want WHERE objectname = 'Clock'; QUIT;
This is what is written to the Log:
Basically, you start at the left (SELECT) and it goes to the Source, then to the Object (table) which has the variables. You can see my criteria in the bottom. Having a map of how SAS thinks allows me as a programmer to make my code more efficent, troubleshoot more effectively, and have really cool things to hang on my wall :-)
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.