We’re smarter together. Learn from this collection of community knowledge and add your expertise.

The sequel to PROC SQL (or, More Advanced Tips and Tricks)

by Super Contributor on ‎06-30-2017 10:05 AM (1,010 Views)

Today's Free Data Friday post revisits PROC SQL and we'll delve into museum data.


Museums have always fascinated me. I'm intrigued by places dedicated to preserving artifacts about people I'll never meet and places I'll probably not get to visit. Canada has a large number of museums, ranging from agriculture to space and everything in between. I've visited many, but definitely want to see more, especially the Canada Science and Technology Museum! 


Get the Data

I found a data set that includes all artifacts from the Canada Agriculture and Food Museum, Canada Aviation and Space Museum, and the Canada Science and Technology Museum. You can get the data from hereFreeDataFriday_graphic.jpg


How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.


Getting the data ready

Because the data uses pipes (|) as the way to separate the columns, the regular Import task won't work, so I used PROC IMPORT:



            DATAFILE= '/folders/myfolders/musuems.csv'
            DBMS=DLM REPLACE;


The Results

I've found myself needing to do some "acrobatics" with some data over the past week and PROC SQL really saved me. I used a lot of these to save time, and I realised that because some of them are undocumented or lesser known, I should post about them. 


The first is an easy way to catch typical typos; it's not as good as PERL, but for basic checks it's easy to use.


SELECT artifactNumber, GeneralDescription
FROM work.want
WHERE GeneralDescription =* 'fiber';


You'll see that instead of the regular = 'fiber', I have put an * after the equals sign. This allows SAS to search for anything that's similar to your criteria.




You'll notice that not only does it recognise "fibre," but also all other capitalizations of the word.  


We're all familiar with the basic math functions, but sometimes it's more efficient and easier to just use a function.



SELECT artifactNumber, begindate, enddate,
enddate) AS
FROM work.want
WHERE group1='Communications';


Instead of doing EndDate - BeginDate, I used RANGE(), which I use if I have a lot of calculations to perform on a number of variables. (I find it easier to type).





The last two tips I wanted to share get more into the debugging / troubleshooting side of PROC SQL. While very handy, they can get you started tracking down pesky issues. I highly recommend reading this paper for more information.  



proc sql _method outobs=10;
select artifactNumber, begindate, enddate
from work.want;

There are no surprises to the results; what's interesting is what is returned in the log. 





SAS is telling me that it's used a SELECT statement or clause on the WORK.WANT table, which we knew, but I wanted a simple example.


If you use the _METHOD on more complex queries, you'll see a wide range of other descriptors; here's the full list (copied from the paper above):


Code Description
SQXCRTA Create table as Select.
SQXSLCT Select statement or clause.
SQXJSL Step loop join (Cartesian).
SQXJM Merge join operation.
SQXJNDX Index join operation.
SQXJHSH Hash join operation.
SQXSORT Sort operation.
SQXSRC Source rows from table.
SQXFIL Rows filtration.
SQXSUMG Summary stats (aggregates) with GROUP BY clause.
SQXSUMN Summary stats with no GROUP BY clause.


The final trick I wanted to show you is more just to spark your interest. I want to do a more involved post on this, as it can get fairly complex. Still, it's a very useful tool to have in your toolbox.  


One of the features of programming languages is how they actually do what they do. I have spent many hours looking at Execution Plans in SQL Server, network packet analysis, etc.  In PROC SQL, we have _TREE:


SELECT artifactNumber, begindate, enddate
FROM work.want
WHERE objectname = 'Clock';

This is what is written to the Log:




Basically, you start at the left (SELECT) and it goes to the Source, then to the Object (table) which has the variables. You can see my criteria in the bottom. Having a map of how SAS thinks allows me as a programmer to make my code more efficent, troubleshoot more effectively, and have really cool things to hang on my wall :-)


Now it’s your turn!


Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.


Need data for learning?


The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:



We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:



Click Analytics U, then select "Subscribe" from the Options menu.


Happy Learning!



Your turn
Sign In!

Want to write an article? Sign in with your profile.

Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.