This articles illustrates how you can use the TIMESERIES procedure to check whether your timeseries data contain a record for every time period and how to insert records for these periods. The article illustrates the rationale for checking your timeseries data for missing records and introduces the %CHECK_TIMEID macro that automates the tasks of checking time series data and inserting records.
Note that the TIMESERIES procedure is part of the SAS/ETS package, thus you only can run the code if you have SAS/ETS licensed. You could create a workaround solution using a SAS Datastep, however as soon as you have BY-groups in your data your SAS Datastep code gets complicated.
Consider the following table: You can easily detect a missing value for variable AMOUNT in July 2004. However there is more missingness in the data. If you investigate the DATE variable you see the records for October and November 2005 are missing. The December record immediately follows the September records. Also in 2006 a couple of months are missing in the time series.
How can that happen? It might be that the data have been deleted accidentally or where not properly loaded from the source systems. If the data in this table have been aggregated from transactional data, it might be the case that no transactions took place in the particular month. Thus the aggregation did not return a value.
Lets have a look at such an example in more detail.
In order to explain the necessity for checking and correcting the completeness of time series data, it is important to differentiate between transactional data and time series data.
The following 2 tables show examples of transactional data and time series data, which have been aggregated on an hourly basis.
Here is the aggregated version of this data:
Note that in the aggregated table, there is no record for time 5:00:00 because no web server traffic occurred during that time. For time series analysis, however, the records need to be equally spaced at the time axis. The existence of all consecutive periods is called contiguity. Time series that are not contiguous can cause problems in analyses because the time series methods will not work properly.
From the SASHELP.AIR data, approximately 10% of the records have been deleted with the following statements:
data air_missing; set sashelp.air; if uniform(12345) < 0.1 then delete; run;
From the following table, you can see that now some records are missing (for example, there is no record for month JULY 1949).
Next, a time series analysis with PROC ESM is executed:
proc esm data = air_missing; id date interval = month; forecast air; run;
PROC ESM checks the data and recognizes that the specified ID variable DATE is not contiguous with the specified INTERVAL MONTH. Thus, the following warning is displayed in the SAS log:
WARNING: 1 omitted observations have been detected before observation number 7 according to the INTERVAL=MONTH option and the ID values. The current ID is DATE=AUG1949, the previous ID was DATE=JUN1949.
Similar warnings are displayed when the data are analyzed with PROC FORECAST or PROC ARIMA.
From this example, you can see that it is important to check time series data for contiguity before running the analysis. If a non continuous time point is detected, the missing records need to be inserted into the data. Note that the point here is not that missing values will be imputed with another value, but that missing records will be inserted into the data, most likely with a value of 0.
Inserting a record with a zero value is plausible when, for example, data such as those shown in the two tables above are considered. Because there was no web server traffic at hour 5, no records are found in the transactional data for this time interval and, thus, the respective record is not created in the time series data. If a record is inserted, the value that makes most sense for this interval is 0 because it represents no traffic.
There are various ways to check for missing records and to insert those records. Implementations include DATA step programs that check for the time interval between two records as well as the creation of a master table that holds a contiguous set of records, which is then merged with the original data to find records for the missing time intervals.
These implementations, however, can become very complicated for time series data with cross-sectional dimensions. In this case, it is necessary to check the contiguity for all possible cross-sectional dimensions that also may have different start and end points.
A very smooth implementation of this task can be achieved with PROC TIMESERIES, which provides the following functionalities:
A typical PROC TIMESERIES example is shown here:
proc timeseries data = air_missing out = TIMEID_INSERTED; id date interval = MONTH setmiss=0; var air; run;
A macro has been coded to help you use PROC TIMESERIES to check for and insert missing records in timeseries data.
This macro can run in two modes: CHECK mode and INSERT mode.
Note that INSERT-mode is very similar to the code PROC TIMESERIES shown above. The rationale of the macro is to provide you a single tool to check for missing records first, and discuss the resulting table with IT and business people. And then insert the the missing values in the second step with almost the same syntax.
Running this macro in CHECK mode for the AIR_MISSING data with the following syntax generates the records in the TIMEID_MISSING data set as shown below.
%check_timeid(data = air_missing, mode = CHECK, timeid = date, value = air);
If there were no missing records from a contiguity perspective, the data set would have been empty. Note that the displayed values for the AIR variable for these records are only an arbitrary value that is used to check for contiguity in the data. They are not inserted into the original table.
Running the macro in INSERT mode with the following syntax creates an output data set called TIMEID_INSERTED:
%check_timeid(data = air_missing, mode = INSERT, timeid = date, value = air);
For month JUL1949, a record has been inserted and the value of AIR has been set to 0. This data set now fulfills the requirements of contiguity.
Note that when using the macro, you can specify a list of variables that receive 0 values for the inserted records. And the macro can also be run with one or more BY variables to insert records if the data are structured with cross-sectional dimensions.
For example, if there are missing records in the SASHELP.PRDSALE data (or in a copy of the data set called PRDSALE_MISSING), these records can be inserted for each BY group with the following statements:
%check_timeid(data = prdsale_missing, out = prdsale_insert, mode = INSERT, timeid = month, value = actual,
by = country region division prodtype product);
You can see from the syntax that a list of variables defining the cross-sectional dimensions has been specified with the BY variable.
Note the following from the syntax:
The following variables can be specified with the macro:
Name of the input data set that holds the time series data (mandatory).
Name of the output data set that holds the original data plus the inserted records. Default = TIMEID_INSERTED.
Name of the output data set that holds missing records. Default = TIMEID_MISSING.
Defines whether the macro will run in CHECK or INSERT mode. Valid values are CHECK or INSERT. Default = INSERT.
Name of the time ID variable (mandatory).
Time interval of the time series data. This interval is used to check for contiguity.
Name(s) of the value variable(s) that receives the inserted value (mandatory). Note that a list of variables or the global variable _NUMERIC_ can be specified here to denote all numeric variables in the data set.
Value that will be inserted into the value variable for missing records. Default = 0.
Optional variable that contains the names of the variables that define the cross-sectional dimensions.
Arbitrary value that is inserted into the data in CHECK mode to identify the missing records. Default = 123456789.123456789.
This example has been taken from my SAS Press book "Data Quality for Analytics Using SAS".
Further books of the author in SAS Press:
Related SAS-Communities articles
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.