# Predicting the Future Using SAS University Edition - Part 3 - Handling your missing data

by on ‎04-01-2016 05:49 PM (628 Views)

• ,

• ,
• ### Tips and Tricks

In the previous two posts on this topic (found here and here) I gave a very brief overview on common tasks done in the majority of Time Series analysis.  In this post, I’m going to introduce how SAS University Edition enables you to compensate for potential gaps in your data.  In the previous articles, I’ve used the Toronto Watermain breakage data and we will continue using it here.  I will pick up where the last article left off (with a minor modification, discussed below), so if you’ve not read through them I recommend doing so quickly.

Here’s the SQL Code from article 1:

I’m going to add a WHERE clause (which goes between the FROM and the GROUP BY) that reads “where break_date>=’01FEB1990’ and break_date<’01MAR1990’ “ – the reason I’m doing this is for three reasons: 1) Limiting the data to a one month period allows any effects to be amplified as each point carries more weight; 2) there is a huge range in the number of breakages; and most importantly, 3) there is only one day in the month that has no breaks.  A single day is ideal for this exercise because it will allow us to see how changing the value for that one specific day can have a significant impact on the statistics for the overall period.

If you’re running the code while you’re reading this, you should have a dataset called WORK.IMPORT2 with just February’s data.  The next step is to explore how changing only the missing value impacts our analyses.  To do this, I ran the Time Series Preparation code with the first Missing Value option, followed by the Time Series Exploration task.  I reviewed what changed, went back and repeated the process for the second option and so on.  Changing only the Missing Value option allows us to clearly see the impacts, and we’ll go through those shortly.

Although I’ve covered both these tasks previously, I wanted to briefly touch on them again.  Here’s what the Time Series Preparation task looks like, as I’ve filled it out for this exercise:

As this is an article on Missing Data, it would make sense if I showed you the options available for handling your missing data; we’ll work through them sequentially (with the exception of the last one which is self-explanatory):

The second task that is key to this post is Time Series Exploration, and here it is all set to go.  Note that the Accumulation is set to SUM; because my periods are days this won’t have an effect, but if the data is in Days and my Time Period is quarters, having the Accumulation set to Average or Sum will have a significant impact on your analysis.  Note that I’ve also added in the Descriptive Statistics on the Analyses tab (not shown):

I am going to pause here and add that although I have had a number of conversations with both SAS employees and SAS users about this article and how to handle missing data, it is still a territory I am extremely unfamiliar with and so I fully anticipate there being further posts with updates.  Having said that, if you see anything that’s not explained well, or could add to the discussion, I welcome comments or messages.

Missing Value

The first option is the easiest; leaving the missing value as a missing value.  In the original dataset, you may notice that February 12 is immediately followed by February 14.  After running the preparation task, this is what the data looks like:

For those of you who are new to SAS, a period indicates the value is a missing numeric value.  When we run the Time Series Exploration Task, here’s the descriptive statistics:

Average Value of the Accumulated Time Series

When I run the tasks with the second option, you’ll see that I now have a value; it’s the average of all the values.

The descriptive statistics now look like this:

Comparing to the missing value method above, the mean did not change (which makes sense) but the standard deviation has gone down (which also makes sense, as we’re using the average of the dataset).

Minimum Value of the Accumulated Time Series

Here’s the output using the next option:

The descriptive statistics have clearly changed, and because we’re using the minimum value the standard deviation has clearly gone up.  Because we are only changing one value, the descriptive statistics don’t have a huge swing.

Median Value of Accumulated Time Series

Output using the median value:

The Median is very close to the Mean, so the descriptive statistics are gong to be similar.

Maximum Value of Accumulated Time Series

Output when we replace the missing value with the maximum value:

This will obviously have a significant impact on the descriptive statistics and the graph.

First Non-Missing Value of Accumulated Time Series

If you have a series of 10 dates with the first, third and fifth values missing, this option will use the second option for all three.  In the case of our data, we’re missing the 13th row of data, and so SAS University Edition takes the first value (which is the first non-missing value) and uses it.

The descriptive statistics and the graph are interesting in that the mean and standard deviation seems to be consistent with the previous examples; again, this makes sense as the value (10) is fairly close to the other values.

Last Non-Missing Value of Accumulated Time Series

Similar to the first non-missing value above, we use the last non-missing value; from the example above, the first, third and fifth spots would be filled with the value from the 10th date.

Here’s the table:

And the descriptive stats (which in this particular case are lower because the new value is lower):

Previous Period Accumulated Non-Missing Value

As I mentioned in the beginning, specifying the Accumulation as Average or Sum will play a significant role in your final output.  This is a good example of how that comes into effect.  Because I’m using Days, the previous period is just the previous day, and so the average and sum are the same.  Hypothetically speaking, if our period was a two-day cycle, the sum would be 6+7 = 13, whereas the average would be (6+7)/2 or 6.5.  Using 13 would have a profound effect on the outcome, as that is higher than the current maximum.

The descriptive statistics are basically close to the previous examples:

Next Period Accumulated Non-Missing Value

Similar to the above example, depending on your data, your settings, and your previously set options, this may have wildly different results.  In this example, we’re fortunate because the next period’s value (February 14th) is still within the other values, so nothing is going to go to wildly askew.

Conclusions

I admit, I have spent more time playing with the two tasks than any other blog I’ve written previously.  Although I’ve been doing forecasts for over 10 years, I’ve always had complete data; handling missing data is very new to me, and so I look forward to your comments, thoughts and suggestions for reading.  Future posts will talk about more complex concepts such as problems you could run into; until next week, Happy Coding!

Now it’s your turn!

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

Need data for learning?

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

Click Analytics U, then select "Subscribe" from the Options menu.

Happy Learning!

Contributors