Tips: Identifying and Locating Missing Values and Gaps in Time Series Data

2 Likes

Long time series are often filled with missing values and gaps in time, but determining if your series has missing values and perhaps locating these values, isn’t always as easy as printing the data.

We will address the issue when a time series contains gaps, such that actual missing values are not observed. For example, you know you have a daily series, but two of the observations are separated by 11 days, such as 01Jan2000 and 12Jan2000. If you quickly print the data, you won’t observe a typical missing value, “.”, because there are none. This isn’t always a practical option with large data sets and multiple by groups.

First we will run PROC TIMEID to determine if there are gaps in the data.

proc timeid data=inputdata outintervaldetails=outint;

id date interval=day;

run;

You could also just use

id date;

Omitting the interval is useful if you are unsure of the interval your data follow. This will lead to a best guess based on the data and will detect the best interval, which will be printed in the outintervaldetails output data set on the INTERVAL variable.

Most of the time we know the interval at which our data are recorded and thus the second line of code should be used rather than omitting the interval=option.

proc print data=outint;

var timeid start end nspans minspan maxspan msg interval;

run;

These are the variables in the outintervaldetails data set that will give information about the series. If NSPANS=1 this means your data are uniformly spaced and no missing time ids are present. SPANS refers to the number of intervals between time ids. If NSPANS > 1, then you have gaps in your data. Notice our data has 17 spans with a minimum of 1 and a maximum of 71. MINSPAN will tell you the minimal number of spans between your dates. MAXSPAN will tell you the max number of spans between your dates There are many variables in this data set. Information about these can be found in the documentation. If you wish, you can omit the VAR statement and view all of the variables, but for this exercise these are sufficient for the information we need.

Additionally, suppose you wish to know WHEN missing dates occur in the time series. This might be easy in a series of hundreds of observations, but becomes difficult with tens of thousands, especially amidst BY groups.

The following code is useful in determining such information.

We will try two approaches. I will only cover PROC TIMEID today but could give this example using PROC TIMESERIES as well.

Using PROC TIMEID, spans component, print where spans >1.
Using PROC TIMESERIES

Using Proc TIMEID :

proc timeid data=here.neah outintervaldetails=all print=values;

id date interval=day ;

ods output Decomposition=decomp;

run;

proc print data=decomp;

where span>1;

run;

Here is a subset of the output

Tips: Identifying and Locating Missing Values and Gaps in Time Series Data

Free course: Data Literacy Essentials

Get Started