I was going to do a Part II post to an earlier Survival Analysis article, "How to Survive Survival Analytics in University Edition," but ran into technical difficulties with my computer. So that article is on hold. Luckily, I started working on a project at work using the Forecasting Tasks, so decided to take this opportunity to dig into it further. Please note I’m learning this at the same time, so if you have any experience I’d appreciate your insight!
Get the Data
I’m using the City of Toronto’s Open Data set on water main breaks, available here; you can read about the data here. As I lived in the city for over 20 years, and have worked in it since I was a teenager, water main breaks are something I’ve seen dozens of times. I wanted to see what interesting things I could find by digging into it, so let’s get started!
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
After I imported the data, I ran a simple SQL script to aggregate the data; each break is a separate row, but I want it as a count by day, so I used this:
proc sql; create table work.import2 as select break_date, count(*) as Total from work.import group by break_date order by break_date; quit;
Our data is now ready to be explored!
Within the Forecasting group of tasks, there are three different functionalities you can use – Time Series Data Preparation, Time Series Exploration, and Modeling and Forecasting. We’ll go through each in the next couple of articles, but I wanted to start off with Time Series Exploration.
Our first step is to set up the task, which is relatively straight forward; our data will be WORK.IMPORT2 (as created above), the Dependent variable is the TOTAL, we want to sum as we group the data and finally we’ll set the TIME ID to break_date.
Click on the Analyses tab, and check off the three different Series Plots; we won’t go through everything (as there’s an awful lot!) but we’ll touch on some of the more interesting things in the data.
After you’ve run the code, you’ll get a graph that looks like this:
There was clearly a spike in 1994, and a huge drop in 2012. This is a great start as it tells us there is something very interesting going on in the data; let’s keep digging!
If you refer back to the image with the DATA Tab, you’ll notice down at the bottom we’ve set the Interval to be YEAR. Change that to Quarter (nothing else needs to be modified) and re-run the query.
So what we can clearly see is that in the 4th Quarter of 1994, there was a huge jump. But you can see there does seem to be consistently higher spikes; let’s see if there’s an additional pattern. Scrolling down in your Results window, you should see a second graph like this:
Wow, what a very clear difference between Quarter 1 and 2! January – March is a very clear period of time where water main breaks are more likely; this makes sense as the winter deep freeze has set in, and near March, freeze / thaw patterns may happen that also result in breaks.
Let’s take it down one more level – change the Interval from the second image to Month, and once again re-run the code. Scrolling down, you’ll see this graph:
Over 600 water main breaks in a single month? That’s an average of 20 a day! I feel bad for the poor city workers who would have had to work in the freezing cold.
Once again scrolling down in the Results window, we see even more clearly the January spike, with the drop down into May – August.
The data set does not have Neighbourhood IDs or anything else that can easily be translated to other data sets I’ve used from the City of Toronto. Although the Geographic Coordinates are provided, they would need to be converted before any further geo-spatial analysis can be done. The next article will be digging into additional Data Exploration features and then get into preparing the data for actual forecasting.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.