We’re smarter together. Learn from this collection of community knowledge and add your expertise.

PROC SQL Continued - Descriptive Stats and a couple of "gotchas"!

by Regular Contributor on ‎01-15-2016 05:13 PM - edited on ‎01-20-2016 04:18 PM by Community Manager (561 Views)

If you’ve been reading the Free Data Friday article series for a while, you’ve seen how we’ve explored various aspects of PROC SQL. This will be the last one on SQL for a while. I’m going to be starting in on other, more complex topics next week (such as Survey Analysis, Time to Event Analysis, and if we’re really lucky, Predictive Modeling and maybe even some Simulation!)

 

This week’s dataset comes from the Government of Canada, and can be downloaded from the Open Data website found hereFreeDataFriday_graphic.jpgThe data, entitled “Ministerial Revenues as per the public accounts of Canada, Fiscal Year 2014/15” is in thousands of dollars and published at the end of each fiscal year (for those who don’t know, a Ministry is similar to a Department in the USA – we have a Ministry of Finance, Community Affairs, etc). 

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

Getting the data ready

I have imported the data into my SAS University Edition into my WORK Library, and called the dataset BLOG.  As with other datasets, I recommend doing some preliminary exploration, including reading any supporting documentation provided by the website.

 

The results

One of the “gotchas” about using different versions of SQL in SAS versus other places (and there’s a lot of other places you can use SQL – Access Databases, SQL Server, Oracle, etc.) is that it’s non-standard.  That means that SAS is able to add, modify or change the structure of the syntax – which is a really good thing, as my previous post on Text Analysis in Song Titles showed!  One of the other ways PROC SQL differs is how you sample data.  In standard SQL, you use SELECT TOP 10 * to select all variables for the first ten rows.  This doesn’t work in SAS, but here’s how you can achieve the same thing:

1.png

And here’s the result:

2.png

I’m going to randomly pick the Return on Investments column, and so I want to see the Ministry (column #2) and pull some basic descriptive statistics.  Here’s the query I’ve written:

3.png

Wait a second – I get an error message that reads “The following columns were not found in the contributing tables: investments, on, return”.  But I open up the dataset, and on the right side it clearly says return-on-investments!  What’s going on?  This is something I find I trip on all the time, and so wanted to show why – SAS shows the label of the variable, not the name!  To find out the actual name of the column, you need to go into the properties.  Luckily for me, the name is return_on_investments, so I don’t have to change too much.

 

4.png

Here’s the updated query:

5.png

 

But when I run the code, I get some rather strange results – why are the same numbers repeating?  If you remember in one of the previous posts where I showed you counting by groups, you need a GROUP BY and an ORDER BY statement.

 

6.png

Once I’ve added in the GROUP BY and ORDER BY, here’s the output.  Much better, but now I’m not sure which column is which statistic!  7.png

 

Here’s a screen shot of the updated code; notice how I have put each summary statistic on it’s own line?  This is a personal preference I have, and I highly recommend you play around with different formats (including indents, whitespace, lines between sections, etc.) so you can not easily read your code, but share it with others.

 

8.png

 

Here’s the resulting output, which is much more logical.  But I’m still not 100% satisfied, because I don’t like the rows with all zeroes – in some cases they would be important, but in this example I want to exclude them.

 

9.png

 

Here’s the updated SQL; simply adding a WHERE statement that says show only the data where the Return_on_investments is greater than 0 allows us to get the results we want.

 

10.png

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

Picture8.png

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 Picture9.png

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

 

Your turn
Sign In!

Want to write an article? Sign in with your profile.