Have a look at your TIMESERIES data from a bird's-eye view - Profile their missing value structure
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
This article introduces the %PROFILE_MV_TS macro and illustrates a different view on your time series data. You will learn how you can profile your data for missing values, zero values as well as different start points and end point of groups of timeseries.
Note that a webinar on this topic has been recorded in my "Data Preparation for Data Science" webinar series. You find a link to this session at the end of this article.
#datapreparation4datascience
Yes! Line-charts are helpful, but:
Usually we use a line-chart to "look" at our timeseries data. Sure, this is definitely a good way to visualize trends, seasons, cycles, outliers in your data and it allows you to decide about the modeling strategy that you would like to apply.
From a data quality point of view, however, you would in addition like to get a different look at our data. You would like to see
- Are there missing value in the time series?
- Are the zero values in the time series?
- Do you have time periods where we have a large cumulation of missing values or zero values?
- Are there leading or trailing missing values or zero values for time series, which you have to treat before we run the timeseries analysis?
- What is the length of our timeseries? Do they have a sufficient data history to run timeseries forecasting models?
- Do your timeseries have different start- and end-dates?
Usually it does not work to browse through your data table to answer that questions. If you have a large number of time series you won't be able to see such pattern.
Counting the number of missing values for each time series and comparing it with the number of observations identifies the proportion of missing values. However, it makes a difference whether these values occur within the series or at the beginning or end of the series. It also makes a difference whether the missing values occur in blocks or are distributed over different periods.
Missing or zero values at the beginning or end of a series may result when no data are available before or after a certain point in time (for example, when a product has not been launched before a certain date or is not sold anymore after a certain date).
Change your view! Look at your data from a bird's-eye view.
Imagine you could look at your timeseries data from a bird's-eye view as shown in the following graph.
This diagram does not focus on timeseries elements like trend or seasonality. However it gives you insight into the data quality status of your time series. Time series are shown in the graph as horizontal rows over the time axis.
- Grey regions represent non-zero values of the timeseries.
- Missing values in the timeseries are represented with red markers.
- Zero values are shown as yellow markers.
- The ordering of the timeseries at the y-axis considers the start-time, the length and the number of missing value or zero values of the time series.
This allows you to see important details about your timeseries data that you want to know before you start analyzing them. It allows the data scientist to decide whether additional data preprocessing is required and how data should be used in the analysis.
Use the %PROFILE_TS_MV macro to obtain a bird's-eye-view on your timeseries
You can use the %PROFILE_TS_MV macro to create a view on timeseries as shown in the above graph.
- Use the program "CreateTimeseriesDemodata.sas" to create artificial timeseries data. Of course you can also use your own time series data.
- Load the "PROFILE_TS_MV macro, which is contained in the attached program "PROFILE_TS_MV_Macro.sas".
- Run the following code to obtain the results.
%Profile_TS_MV(data = timeseries_demodata,
id = tsid,
date = MonYear,
value= Quantity,
mv = (.),
zv = (0), scatterplot=YES, NMAX_TS=1000);
Maybe you want to use ODS GRAPHICS before the macro to specify the size for a (larger) graph.
ods graphics / width= 18cm height=13cm;
The macro outputs the following tables and the above shown graph.
The TS_PROFILE_CHAIN
First you receive a table with a frequency count for the TS_PROFILE_CHAIN. This chain is a concatenation of the following characters for each timepoint:
- 1: non-zero and non-missing value
- 0: zero-value
- X: missing-value
Thus a TS_PROFILE_CHAIN of
111111111_9_0
represents a timeseries with 9 timepoints and for each timepoint a non-zero and non-missing value was found.
Whereas a TS_PROFILE_CHAIN of
00111XX111000_13_2
represents a time series with 13 datapoints, with 2 (embedded) missing-values and 2 leading and 3 trailing zero values.
From the frequency table below you can see the frequency distribution of the TS_PROFILE_CHAIN.
- The data contains 96 timeseries with a length of 24 months, 79 time series with 48 months.
- You also see, that you have 35 timeseries of length 12. You might want to consider a different method for time series forecasting for these timeseries compared to those where you have a length of 24 months or more.
- There is also a group of timeseries with 6 months of leading zeros. In retail sales this might indicate that the product was already available in the system, before it was actually sold in the shops and therefore create zero-values. You might want to remove these periods from your time series.
Note that the title displays the number of individual timeseries and the dataset name.
Frequency of the TS_PROFILE_CHAIN
There is also a compressed version of the TS_PROFILE_CHAIN. Here only changes between values of 1, 0 or X are indicated.
Compressed version of TS_PROFILE_CHAIN
You see that all timeseries with non-zero or non-missing values are represented in the "1" category and those with leading zeros are represented in the "01" category.
Other tables show the distribution of the length of the time series and the distribution of missing values per time series.
Ordering the timeseries for the display in the graph
Note that the macro orders the timeseries in order to be able to display it (as shown in the above picture) in a grouped way based on startdate, timeseries length and number of missing values or zero values.
With the following code you could create the above graph in an "unsorted" way.
proc sgplot data=MV_PROFILE_TS_PLOT dattrmap=attrmap;
title2 Profile Plot for the Missing Value Structure of the Time Series;
scatter x=monyear y=_ID_ / group=MV_Ind markerattrs=(symbol=squarefilled) attrid=MV_Ind;
run;
You see that the ordering from the macro gives much more insight into the structure and blocks compared to the non-sorted display as shown below.
From a data quality perspective with the focus to check the length and the completeness of the time series, the above graphs still give you more insight compared to the pure line chart.
proc sgplot data=timeseries_demodata;
series x=MonYear y=Quantity / group=tsid;
where quantity ne 0 ;
run;
The Macro %PROFILE_TS_MV
Note the prerequisites for using the macro:
- Each time series needs to have an &ID variable. If a time series with the same &ID occurs more than once in the data (for example, per sales region or sales channel), the respective cross-sectional dimensions variable needs to be specified with the &CROSS macro variable. The macro then concatenates these variables to a new ID variable.
- As a best practice, use GIF or JPG as graphical devices, especially if the input data have hundreds of time series. Otherwise, if JAVA or ACTIVEX is used, creating the profile plot is very time consuming. The statement GOPTIONS DEVICE = GIF can be used before the macro call to change the graphics device to GIF.
The following parameters can be specified with the macro:
DATA
Name of the input data set (mandatory)
ID
Name of the time series ID variable (mandatory).
CROSS
List of variables for the cross-sectional dimensions (optional). The values of these variables are concatenated first to a string and then to the ID variable in order to allow the analysis of the same time series ID for different subgroups.
DATE
Name of the time ID variable (mandatory).
VALUE
Name of the variable that holds the values of the time series (mandatory).
MV
List of values to be considered as missing. Values need to be specified in brackets, separated by a comma. Examples are mv=(.) and mv=(.,9). The default value is (.). Note that this list is used with an IN operator in the DATA step.
ZV
List of values to be considered as zero. Values need to be specified in brackets, separated by a comma. Examples are mv=(0) and mv=(0,-1). The default value is (0). Note that this list is used with an IN operator in the DATA step.
PLOT
Profile plots are only produced if PLOT is set to YES. Default = YES.
LINEPLOT
Shall the profile plot be generated as a line plot. YES|NO. Default = NO
SCATTERPLOT
Shall the profile plot be generated as a scatter plot. YES|NO. Default = YES
NMAX_TS
Depending on the number of time series in the data, this option controls whether a profile plot will be produced. The default number is 500. Thus, if your data contain more than 500 time series, a profile plot will not be produced and you will need to set the value higher. In this case, the following message in the log file is produced:
------------------------------------------------------------------------------
Number of time series = 1026, is higher than nmax_ts value(= 100).
No plot has been created. Reset parameter NMAX_TS to a value of at least 1026
-----------------------------------------------------------------------------
W
Defines the thickness of the lines used in the line plot to represent one time series.
Links and References
This example has been taken from my SAS Press book "Data Quality for Analytics Using SAS".
Further books of the author in SAS Press:
- Applying Data Science - Business Case Studies Using SAS
- Data Preparation for Analytics Using SAS: See page 250f for examples on how to aggregate and disaggregate for different levels of granularity at the time axis.
Related SAS-Communities articles
-
The structure of MISSING VALUES in your data - get a clearer picture with the %MV_PROFILING macro
- Using the TIMESERIES procedure to check the continuity of your timeseries data
-
Replace MISSING VALUES in TIMESERIES DATA using PROC EXPAND and PROC TIMESERIES