We are jumping back to two weeks ago where we first started exploring Time Series Analysis using water main breakage data for the city of Toronto. Today's post picks up where we left off. This time we’re preparing the data for the next post, where we’ll actually start doing some modeling and forecasting.
Get the Data
From the first article, I’m using the Open Data Toronto’s dataset for water main breaks,
I’m making one minor change to my original code by adding in the Break_Year variable:
proc sql; create table work.import2 as select break_year, break_date, count(*) as Total from work.import group by break_year, break_date order by break_year, break_date; quit;
We’ll be using this for the duration of this series.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
Once you’ve imported the data, you should be ready to go; the dataset has everything you need – a date time stamp, the year, and the counts.
The first thing we’ll take a look at in this task is what happens when we run it without any options – just the basic information, Data and which variable is our Time Series Variable.
The task has created a new table (WORK.TSPREP) and has assigned a TIME ID to each of the dates.
That’s great and all but not really interesting, but we need to figure out what the task does.
Here, I’ve gone through and filled out a little more information – Time Series variables are Break_Year and Total, my Time ID is the Break_Date, and I’ve set the interval to Day.
Then I go into the Transformations tab and set up the Total to be summed; because I have one row per day, I could have used average.
I run the task, and this is what I get:
This is rather disappointing – it’s identical to the original dataset. What’s the point of running this task then? Then I scroll down and it all makes sense.
SAS has gone through and where I have missing dates, has assigned a date. That way -- regardless how much data I have, whatever interval I’ve set (Day, Month, etc) -- the data is complete. The only possible challenge is that I have missing data (indicated by the periods). In many cases, I would want numeric values to be assigned to the missing values. If I go back to the Data tab in my task, I see there’s a drop down with a variety of choices:
Now, I must admit that in all the time series analyses I’ve done, I’ve always kept the value as missing. Looking at this list however makes me curious when each option would be a better choice. So I’m going to do some research over the next few days and have that as part of the next article.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.