I wanted this to be a continuation of the missing data series, digging more into handling missing values as well as showing appropriate techniques to replace your gaps (or not). But that's for another day. As I am (like many of you) suffering from a wicked bout of allergies, my brain could not handle all the intense math. So let's continue with missing data, this time revisiting PROC SQL.
Get the Data
From the first article, I’m using the Open Data Toronto’s dataset for water main breaks.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
Once you’ve imported the data, you should be ready to go; the dataset has everything you need – a date time stamp, the year, and the counts.
I am using the data created by the Time Series Preparation task (as outlined in this article, Predicting the Future Using SAS University Edition - Part 2 - Time Series Preparation) and have left the missing values as missing.
proc sql; create table work.time2 as select year(break_date) as yr, month(break_date) as month, weekday(break_date) as wkday, total from work.tsprep; quit;
This code takes the Break_Date variable and strips out the Year, Month, and Weekday into new variables.
I aim to show you a quick and efficient way to have a more complete picture of your dataset. Now that we’ve created the WORK.TSPREP data set (reminder: this puts in all the days that don’t have any values, and assigns a period to that day to indicate it’s missing), we’re going to use it in our SQL.
proc sql; select yr, count(case when (total=.) then 1 end) as NoBreaks, count(case when (total<>.) then 1 end) as Breaks from work.time2 group by yr order by yr; quit;
This is a fairly straightforward example, and if you remember the previous Free Data Friday article, How to Analyze Meteorite Data from NASA, you’ll recall that we set up CASE statements in our descriptive statistics to apply logic to the data. This is the same thing we’ll do here, but in a much simpler form; all I’m saying is if the TOTAL variable is missing (in SAS terms, contains a period) then call the variable “NoBreaks” and if the TOTAL does not contain a period, call it “Breaks”. Here’s what I get:
So, from the previous articles, we know that 2009 had a very small number of breaks compared to other years, and that 1994 had the highest.
Having worked with time series data for many years, one step I always take is to break the data down into smaller groups; in this case, month and day.
proc sql; select month, count(case when (total=.) then 1 end) as NoBreaks, count(case when (total<>.) then 1 end) as Breaks from work.time2 group by month order by month; select wkday, count(case when (total=.) then 1 end) as NoBreaks, count(case when (total<>.) then 1 end) as Breaks from work.time2 group by wkday order by wkday; quit;
When I run these, here’s what I get:
Again, from the previous articles, no surprise that the number of breaks are highest in January, March, and December. What I wasn’t expecting was the breaks to be higher Monday – Thursday, but then it dawned on me – this would be when construction was going on, and if there’s digging, chances are good that they’ll hit something.
This is a prime example of why looking at your data in multiple ways, and having an understanding of the data, is so critical. In many places, Monday is the first day of the week, which would shift the interpretation. In most areas of the world, the weather doesn’t get cold enough to break the water mains, so seeing such high numbers in the winter months wouldn’t make sense if you were unfamiliar with the weather patterns in Toronto. Being able to pull out these statistics is essential to being a data analyst, but having an understanding of what they are telling you is even more important.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.