This article introduces the %PROFILE_MV_TS macro and illustrates a different view on your time series data. You will learn how you can profile your data for missing values, zero values as well as different start points and end point of groups of timeseries.
Note that a webinar on this topic has been recorded in my "Data Preparation for Data Science" webinar series. You find a link to this session at the end of this article.
#datapreparation4datascience
Usually we use a line-chart to "look" at our timeseries data. Sure, this is definitely a good way to visualize trends, seasons, cycles, outliers in your data and it allows you to decide about the modeling strategy that you would like to apply.
From a data quality point of view, however, you would in addition like to get a different look at our data. You would like to see
Usually it does not work to browse through your data table to answer that questions. If you have a large number of time series you won't be able to see such pattern.
Counting the number of missing values for each time series and comparing it with the number of observations identifies the proportion of missing values. However, it makes a difference whether these values occur within the series or at the beginning or end of the series. It also makes a difference whether the missing values occur in blocks or are distributed over different periods.
Missing or zero values at the beginning or end of a series may result when no data are available before or after a certain point in time (for example, when a product has not been launched before a certain date or is not sold anymore after a certain date).
Imagine you could look at your timeseries data from a bird's-eye view as shown in the following graph.
This diagram does not focus on timeseries elements like trend or seasonality. However it gives you insight into the data quality status of your time series. Time series are shown in the graph as horizontal rows over the time axis.
This allows you to see important details about your timeseries data that you want to know before you start analyzing them. It allows the data scientist to decide whether additional data preprocessing is required and how data should be used in the analysis.
You can use the %PROFILE_TS_MV macro to create a view on timeseries as shown in the above graph.
%Profile_TS_MV(data = timeseries_demodata,
id = tsid,
date = MonYear,
value= Quantity,
mv = (.),
zv = (0), scatterplot=YES, NMAX_TS=1000);
Maybe you want to use ODS GRAPHICS before the macro to specify the size for a (larger) graph.
ods graphics / width= 18cm height=13cm;
The macro outputs the following tables and the above shown graph.
First you receive a table with a frequency count for the TS_PROFILE_CHAIN. This chain is a concatenation of the following characters for each timepoint:
Thus a TS_PROFILE_CHAIN of
111111111_9_0
represents a timeseries with 9 timepoints and for each timepoint a non-zero and non-missing value was found.
Whereas a TS_PROFILE_CHAIN of
00111XX111000_13_2
represents a time series with 13 datapoints, with 2 (embedded) missing-values and 2 leading and 3 trailing zero values.
From the frequency table below you can see the frequency distribution of the TS_PROFILE_CHAIN.
Note that the title displays the number of individual timeseries and the dataset name.
Frequency of the TS_PROFILE_CHAIN
There is also a compressed version of the TS_PROFILE_CHAIN. Here only changes between values of 1, 0 or X are indicated.
Compressed version of TS_PROFILE_CHAIN
You see that all timeseries with non-zero or non-missing values are represented in the "1" category and those with leading zeros are represented in the "01" category.
Other tables show the distribution of the length of the time series and the distribution of missing values per time series.
Ordering the timeseries for the display in the graph
Note that the macro orders the timeseries in order to be able to display it (as shown in the above picture) in a grouped way based on startdate, timeseries length and number of missing values or zero values.
With the following code you could create the above graph in an "unsorted" way.
proc sgplot data=MV_PROFILE_TS_PLOT dattrmap=attrmap;
title2 Profile Plot for the Missing Value Structure of the Time Series;
scatter x=monyear y=_ID_ / group=MV_Ind markerattrs=(symbol=squarefilled) attrid=MV_Ind;
run;
You see that the ordering from the macro gives much more insight into the structure and blocks compared to the non-sorted display as shown below.
From a data quality perspective with the focus to check the length and the completeness of the time series, the above graphs still give you more insight compared to the pure line chart.
proc sgplot data=timeseries_demodata;
series x=MonYear y=Quantity / group=tsid;
where quantity ne 0 ;
run;
Note the prerequisites for using the macro:
DATA
Name of the input data set (mandatory)
ID
Name of the time series ID variable (mandatory).
CROSS
List of variables for the cross-sectional dimensions (optional). The values of these variables are concatenated first to a string and then to the ID variable in order to allow the analysis of the same time series ID for different subgroups.
DATE
Name of the time ID variable (mandatory).
VALUE
Name of the variable that holds the values of the time series (mandatory).
MV
List of values to be considered as missing. Values need to be specified in brackets, separated by a comma. Examples are mv=(.) and mv=(.,9). The default value is (.). Note that this list is used with an IN operator in the DATA step.
ZV
List of values to be considered as zero. Values need to be specified in brackets, separated by a comma. Examples are mv=(0) and mv=(0,-1). The default value is (0). Note that this list is used with an IN operator in the DATA step.
PLOT
Profile plots are only produced if PLOT is set to YES. Default = YES.
LINEPLOT
Shall the profile plot be generated as a line plot. YES|NO. Default = NO
SCATTERPLOT
Shall the profile plot be generated as a scatter plot. YES|NO. Default = YES
NMAX_TS
Depending on the number of time series in the data, this option controls whether a profile plot will be produced. The default number is 500. Thus, if your data contain more than 500 time series, a profile plot will not be produced and you will need to set the value higher. In this case, the following message in the log file is produced:
------------------------------------------------------------------------------
Number of time series = 1026, is higher than nmax_ts value(= 100).
No plot has been created. Reset parameter NMAX_TS to a value of at least 1026
-----------------------------------------------------------------------------
W
Defines the thickness of the lines used in the line plot to represent one time series.
This example has been taken from my SAS Press book "Data Quality for Analytics Using SAS".
Further books of the author in SAS Press:
Related SAS-Communities articles
The structure of MISSING VALUES in your data - get a clearer picture with the %MV_PROFILING macro
Replace MISSING VALUES in TIMESERIES DATA using PROC EXPAND and PROC TIMESERIES
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.