Proc TSMODEL Notes: Creating Input Variable Lead Values, Part 1

1 Like

In most cases, generating predictions of the future with timeseries or machine learning models that contain input variables requires providing lead or future values for the inputs. In this sequence of posts, we’ll focus on functionality in the TSMODEL procedure that enables the creation and implementation of input variable future values.

Future covariate values can be user provided. Alternatively, the forecasting system can be set up in a way that generates lead values for inputs included in models as a step in the forecasting process. This post focuses on the data arrangement and functionality necessary for implementing user provided lead values of inputs in a TSMODEL based forecasting project.

We’ll start with a simple data set, BLOGDATA, shown below.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

There are three BY groups or sequences, identified by the variable, ACCOUNT. Y will be the dependent variable. Account 1 has the most observations and the earliest starting date, 01FEB2025. Account 2 has four observations and the most recent end date, 01JUL2025. Account 3 has only two observations.

After loading the data into memory, it’s read into the TSMODEL procedure. The syntax shown below creates a simple variable that counts intervals, TIME. The ID, VAR and BY statements combine to define the timeseries or arrays that flow into the SUBMIT block for processing. The OUTARRAYS statement declares that a new array, TIME, will be created in the submit block. Existing and new arrays will be contained in the OUTARRAY table, BLOGLEADS.

To create a new array for each of the three Y timeseries (ACCOUNT 1 – 3), we’ll use a loop in the SUBMIT block. The intention is to use the loop processing to generate values for TIME that run from first observation of each Y series to 6 intervals past the last observation.

proc tsmodel data=casuser.blogdata outarray=casuser.blogleads;
    id date interval=month;
    by account;
    var y;
    outarray time;
    submit;
        do i = 1 to dim(y) + 6;
            time[i]=i;    
        end;
    endsubmit;
run;

The resulting table, BLOGLEADS, doesn’t give us what we intended to create, as shown below. We’ll describe what happened first. Then we’ll outline the modifications necessary to produce the BLOGLEADS table with the desired arrangement.

Statements and options above the SUBMIT block define the arrays that flow into it. The length of an array, under default settings, is regulated by the earliest and latest dates associated with the arrays listed on the VAR statement. No lead values for TIME were produced. However, DATE values were added, and the Y, DATE and ACCOUNT arrays were extended. Notice that the TIME variable has 6 values for each array, and that the start and end dates were modified so that they are the same for each Y array.

Default settings on two options, LEAD and TRIMID, are changed in the syntax shown next to produce the desired table.

proc tsmodel data=casuser.blogdata outarray=casuser.blogleads lead=6;
    id date interval=month trimid=both;
    by account;
    var y;
    outarray time;
    submit;
       do i = 1 to dim(y);
            time[i] = i;
      end;
   endsubmit;
run;

On the ID statement, the TRIMID value is changed from NONE (default) to BOTH. This allows each Y array to be a different length as defined by the dates corresponding to its first and last non-missing values. The LEAD option on the procedure statement is changed from 0 (default) to 6. This modifies the dimension of the Y arrays that flow into the SUBMIT block to be six intervals greater than their original length (defined by the TRIMID option).

Note that the +6 could be left in the definition of the loop, and the desired table will still be produced. The request to extend TIME six intervals past the end of a given, now modified, Y would be ignored, as it was in the first call to TSMODEL, because doing this would exceed the dimension of each Y as defined outside the SUBMIT block.

The modified BLOGLEADS table has the desired arrangement and is shown next.

As a next step, Y and TIME would be assigned the roles of dependent and independent variables. We’ll show how this is done in the next post using the ATSM package in a TSMODEL based forecasting project. The lead forecast horizon for each dependent variable array is defined to start on the interval after its last non-missing value, so the last six values for TIME associated with the missing (future) values for Y would be recognized as user provided lead values for the input.

Bonus tip

The following describes a situation that, given the previous information, may be obvious. However, it may also provide further insight into how the LEAD option works and help you avoid unexpected results. Suppose that we want to add additional input variables, TIMESQ, and the scalar column, ONE, with lead values, to the existing BLOGLEADS table and submit the following.

proc tsmodel data=casuser.blogleads outarray=casuser.blogleads2 lead=6;
    id date interval=month trimid=both;
    by account;
    var y time;
    outarray timesq one;
    submit;
        do i = 1 to dim(y);
            one[i] = 1;
            timesq[i] = one[i]*time[i]*time[i];
        end;
    endsubmit;
run;

This gives the following BLOGLEADS2 table (only ACCOUNT 1 values are shown).

The call to TSMODEL that generated the BLOGLEADS table added 6 intervals to the original dimension of the Y arrays. The LEAD = 6 on the second call added another 6 intervals. Now, the appropriate value for LEAD, given we want 6 lead values of the features TIME, TIMESQ and ONE, is 0.

In a forecasting or extrapolation context, time varying input variables come in two relevant types; deterministic and stochastic. For deterministic, time-varying inputs, like a customer’s age next year or the future date of a recurring holiday event, future values are known. The process of implementing their future values in dependent variable forecasts is just a matter of getting them into the data in the correct arrangement so that the software recognizes them for what they are. This is the process that this post has tried to explain.

For stochastic, time-varying input variables, like a customer’s credit score or the yield on US treasuries, future values are not known with certainty. In the stochastic case, there are a few different options for providing lead input values. In the next post in this sequence, we’ll assume that the analyst has chosen to use a model to predict the future values of included input variables based on the inputs’ historical values, and focus will be on the details of how this is implemented in Proc TSMODEL.

Find more articles from SAS Global Enablement and Learning here.

Proc TSMODEL Notes: Creating Input Variable Lead Values, Part 1

Ready to see what SAS Viya Copilot can do?

SAS AI and Machine Learning Courses