Using the TIMESERIES procedure to check the continuity of your timeseries data

6 Likes

This articles illustrates how you can use the TIMESERIES procedure to check whether your timeseries data contain a record for every time period and how to insert records for these periods. The article illustrates the rationale for checking your timeseries data for missing records and introduces the %CHECK_TIMEID macro that automates the tasks of checking time series data and inserting records.

Note that the TIMESERIES procedure is part of the SAS/ETS package, thus you only can run the code if you have SAS/ETS licensed. You could create a workaround solution using a SAS Datastep, however as soon as you have BY-groups in your data your SAS Datastep code gets complicated.

MISSING RECORDS or MISSING VALUES?

Timeseries data with missing values and missing records

Consider the following table: You can easily detect a missing value for variable AMOUNT in July 2004. However there is more missingness in the data. If you investigate the DATE variable you see the records for October and November 2005 are missing. The December record immediately follows the September records. Also in 2006 a couple of months are missing in the time series.

How can that happen? It might be that the data have been deleted accidentally or where not properly loaded from the source systems. If the data in this table have been aggregated from transactional data, it might be the case that no transactions took place in the particular month. Thus the aggregation did not return a value.

Lets have a look at such an example in more detail.

The difference between transactional data and time series data

In order to explain the necessity for checking and correcting the completeness of time series data, it is important to differentiate between transactional data and time series data.

Transactional data are time-stamped data that are collected with no particular frequency over time. Examples of transactional data include Internet data like website visits or point-of-sale data in retail.
Time series data, on the other hand, are time-stamped data that are collected at a particular frequency over time, like web hits per hour or sales amounts per day.

The following 2 tables show examples of transactional data and time series data, which have been aggregated on an hourly basis.

Transactional Data

Here is the aggregated version of this data:

Timeseries Data

Note that in the aggregated table, there is no record for time 5:00:00 because no web server traffic occurred during that time. For time series analysis, however, the records need to be equally spaced at the time axis. The existence of all consecutive periods is called contiguity. Time series that are not contiguous can cause problems in analyses because the time series methods will not work properly.

Non-continuous time series data cause problems in the analysis

From the SASHELP.AIR data, approximately 10% of the records have been deleted with the following statements:

data air_missing;
 set sashelp.air;
 if uniform(12345) < 0.1 then delete;
run;

From the following table, you can see that now some records are missing (for example, there is no record for month JULY 1949).

Next, a time series analysis with PROC ESM is executed:

proc esm data = air_missing;
 id date interval = month;
 forecast air;
run;

PROC ESM checks the data and recognizes that the specified ID variable DATE is not contiguous with the specified INTERVAL MONTH. Thus, the following warning is displayed in the SAS log:

WARNING: 1 omitted observations have been detected before observation
 number 7 according to the INTERVAL=MONTH option and the ID
 values. The current ID is DATE=AUG1949, the previous ID was
 DATE=JUN1949.

Similar warnings are displayed when the data are analyzed with PROC FORECAST or PROC ARIMA.

Checking and assuring contiguity

From this example, you can see that it is important to check time series data for contiguity before running the analysis. If a non continuous time point is detected, the missing records need to be inserted into the data. Note that the point here is not that missing values will be imputed with another value, but that missing records will be inserted into the data, most likely with a value of 0.

Inserting a record with a zero value is plausible when, for example, data such as those shown in the two tables above are considered. Because there was no web server traffic at hour 5, no records are found in the transactional data for this time interval and, thus, the respective record is not created in the time series data. If a record is inserted, the value that makes most sense for this interval is 0 because it represents no traffic.

There are various ways to check for missing records and to insert those records. Implementations include DATA step programs that check for the time interval between two records as well as the creation of a master table that holds a contiguous set of records, which is then merged with the original data to find records for the missing time intervals.

These implementations, however, can become very complicated for time series data with cross-sectional dimensions. In this case, it is necessary to check the contiguity for all possible cross-sectional dimensions that also may have different start and end points.

Using the TIMESERIES procedure to handle missing records

A very smooth implementation of this task can be achieved with PROC TIMESERIES, which provides the following functionalities:

Aggregates, aligns, and describes transactional or time series data
Converts transactional data into time-stamped data
Performs seasonal decomposition of the data
Checks for and inserts missing records
Handles and converts missing and zero values

A typical PROC TIMESERIES example is shown here:

proc timeseries data = air_missing
 out = TIMEID_INSERTED;
 id date interval = MONTH setmiss=0;
 var air;
run;

The ID statement sets the time variable and the interval spacing.
Here, the SETMISS option defines which values to insert for the VAR variable AIR if a missing record is detected.
Missing records are identified automatically based on the ID variable and the spacing.

The %CHECK_TIMEID macro

A macro has been coded to help you use PROC TIMESERIES to check for and insert missing records in timeseries data.

This macro can run in two modes: CHECK mode and INSERT mode.

CHECK mode checks for the existence of missing records.
INSERT mode inserts the missing records.

Note that INSERT-mode is very similar to the code PROC TIMESERIES shown above. The rationale of the macro is to provide you a single tool to check for missing records first, and discuss the resulting table with IT and business people. And then insert the the missing values in the second step with almost the same syntax.

Running this macro in CHECK mode for the AIR_MISSING data with the following syntax generates the records in the TIMEID_MISSING data set as shown below.

%check_timeid(data = air_missing, 
 mode = CHECK, 
 timeid = date,
 value = air);

If there were no missing records from a contiguity perspective, the data set would have been empty. Note that the displayed values for the AIR variable for these records are only an arbitrary value that is used to check for contiguity in the data. They are not inserted into the original table.

Running the macro in INSERT mode with the following syntax creates an output data set called TIMEID_INSERTED:

%check_timeid(data = air_missing,
 mode = INSERT, 
 timeid = date, 
 value = air);

For month JUL1949, a record has been inserted and the value of AIR has been set to 0. This data set now fulfills the requirements of contiguity.

Code Example to use the %CHECK_TIMEID macro

Note that when using the macro, you can specify a list of variables that receive 0 values for the inserted records. And the macro can also be run with one or more BY variables to insert records if the data are structured with cross-sectional dimensions.

For example, if there are missing records in the SASHELP.PRDSALE data (or in a copy of the data set called PRDSALE_MISSING), these records can be inserted for each BY group with the following statements:

%check_timeid(data = prdsale_missing,
 out = prdsale_insert,
 mode = INSERT,
 timeid = month,
 value = actual,
 by = country region division prodtype product);

You can see from the syntax that a list of variables defining the cross-sectional dimensions has been specified with the BY variable.

Note the following from the syntax:

PROC TIMESERIES is used to check for missing records and to insert the missing records.
Inserting the missing records is done using typical PROC TIMESERIES syntax. An insert value different from 0 can be specified with the INSERTVALUE macro variable.
Checking for missing records is also done with PROC TIMESERIES:
Here, records with an arbitrary value of -123456789.123456789 are inserted if a record is missing. The inserted value is then used to identify the inserted records.
If the default value of -123456789.123456789 conflicts with real data, a different value using the CHECKDUMMYVALUE variable can be specified.
Cross-sectional dimensions are treated with the BY statement in PROC TIMESERIES.
If BY variables are specified with the macro, the input data are sorted for these variables, which may take some time for large data sets.

Parameters of the %CHECK_TIMEID macro

The following variables can be specified with the macro:

DATA

Name of the input data set that holds the time series data (mandatory).

OUT

Name of the output data set that holds the original data plus the inserted records. Default = TIMEID_INSERTED.

OUT_CHECK

Name of the output data set that holds missing records. Default = TIMEID_MISSING.

MODE

Defines whether the macro will run in CHECK or INSERT mode. Valid values are CHECK or INSERT. Default = INSERT.

TIMEID

Name of the time ID variable (mandatory).

INTERVAL

Time interval of the time series data. This interval is used to check for contiguity.

VALUE

Name(s) of the value variable(s) that receives the inserted value (mandatory). Note that a list of variables or the global variable _NUMERIC_ can be specified here to denote all numeric variables in the data set.

INSERTVALUE

Value that will be inserted into the value variable for missing records. Default = 0.

BY

Optional variable that contains the names of the variables that define the cross-sectional dimensions.

CHECKDUMMYVALUE

Arbitrary value that is inserted into the data in CHECK mode to identify the missing records. Default = 123456789.123456789.

Links and References

This example has been taken from my SAS Press book "Data Quality for Analytics Using SAS".

Further books of the author in SAS Press: