Data Step for Timeseries: part 1, Overview

1 Like

The purpose of this article it to provide an overview of concepts and SAS tools related to creating and processing timeseries as arrays. The usefulness of SAS timeseries array processing functionality, also known as the SAS data-step-for-timeseries toolbox, is illustrated in three demonstrations.

A timeseries is an indexed set of equally spaced values. Information or signal can exist in the order of and distance between values, so sequences need to remain intact in the processes of timeseries data creation, exploration and processing. A natural way to think about a timeseries in the context of data handling is as an array. An array provides a way to process a sequence of values based on an index and other user provided attributes. Timeseries data handling based on the idea of array processing is featured in both SAS Viya and SAS 9, and we’ll generally refer to this functionality as the SAS data-step-for-timeseries toolbox.

The purpose of this series of articles is to introduce and explain the tools and to illustrate their usefulness through a series of examples. This article provides an overview of concepts on creating and processing timeseries as arrays. Subsequent articles are previewed here with three demonstrations. Article 2 will focus on timeseries BY group processing. Multiple timeseries arrays are defined and processed using BY group or sub-setting variables. Article 3 focuses on creating user defined subroutines and functions and then using them in an array processing block of syntax. Topics covered in future articles will depend on reader feedback, so let us know what you think and provide suggestions for data-step-for-timeseries topics.

Demo 1, Initial and New Arrays

In the first example, we’ll use the AIR data set in the SASHELP library. This table and contains two variables: a count of US airline passengers, AIR, and a time index, DATE. The natural interval of the data is month and there are 144 observations. A portion of this table is shown.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

It may be useful to think of the TIMEDATA Procedure (SAS/ETS) processing shown here in two steps. First, selected variables in the input data set are named and initial arrays are created. Second, new arrays are created by operating on elements of initial arrays defined in the first step.

The PROCEDURE statement lists the input data set and two data sets that will contain the results of the processing. The OUT=WORK.AIR table will contain only the time ID and the initial arrays listed on the VARS statement. The OUTARRAY table contains the time ID, the initial arrays and any new arrays created in processing.
The ID statement names the time index variable from the input data set and lists the desired interval and accumulation method for array creation.
The VARS statement lists the initial arrays.
The OUTARRAYS statement names new arrays that will be created in subsequent processing.
The DO block contains the processing for new array creation.

Note that the ID and VARS statements combine to uniquely define the array, AIR. In this case, one observation on passenger count per quarter is derived by averaging the monthly observations in the input table. Then, the DO block syntax creates four new arrays by operating on elements of the array, AIR.

proc timedata data=sashelp.air out=work.air outarray=work.airarray print=(arrays);
   id date interval=quarter accumulate=average format=yymmdd.;
   vars air;
   outarrays rw_trend lin_trend quad_trend s4 c4;
   twopi=2*constant("pi");
      do t= 1 to dim(air);
         rw_trend[t] = air[t-1];
         lin_trend[t] = t;
         s4[t] = sin(twopi*t/4);
         c4[t] = cos(twopi*t/4);
      end;
   Title "Create arrays for different trends and sinusoids";
run;

Let’s investigate the results of the TIMEDATA call and discuss some details.

The AIR variable in the WORK.AIRARRAY table is a quarter interval time series. It’s first value is the average of the first three values of (month interval) AIR in the input table. The DATE variable has a quarter interval as specified in the ID statement. Four new arrays are created. These are common timeseries model features.

RW_TREND: a random walk prediction for the next interval is the observed value for the current interval. The lagged (t-1) value for AIR is assigned to current (t) value for RW_TREND in the DO block.
LIN_TREND: this array simply increments by one for each successive time interval. It could be useful as an input variable in a model to capture trend variation in a deterministic way.
S4 and C4 are a sine, cosine pair that repeat every four intervals. These can be useful for capturing a season pattern in quarter interval data.

Note that SAS functions, commonly found in DATA step syntax, are valid to use in the TIMEDATA procedure. TIMEDATA and other tools implementing the data-step-for-timeseries approach accommodate most of the SAS programming statements and SAS functions that you can use in a DATA step.

Demo 2, BY Group Processing

This example will feature SAS Viya. While this software framework is different than SAS 9, shown in the first demonstration, the approach is consistent and the data-step-for-timeseries tools are implemented in a similar way. In general, a table that will be used for BY group processing has sequences stacked on top of each other, and the data is sorted according to the BY variables and the time ID. In the simple table we’ll use here there’s one sub-setting variable, and the data has been sorted by: BY_GRP, DATE. BY_GRP values identify two P_STATUS sequences. A subsequent article will illustrate how a table with more BY variables is organized.

Notes on the TSMODEL procedure (SAS Viya/Visual Forecasting) syntax:

BYGRP_IN is an in-memory table that lives in a CAS library, MYLIB. Three in-memory tables are produced. The OUTARRAY table contains the time ID, the initial arrays, here P_STATUS and any new arrays created in processing. The OUTSCALAR table will contain system scalars defined in the processing. We’ll describe system scalars shortly. The OUTSUM table provides summary measures on initial and new arrays created in the TSMODEL Procedure.
The ID and VAR statements combine to uniquely define initial arrays, here P_STATUS. Since the data flowing in already has a monthly interval, nothing was really changed in the process of creating the initial arrays in this example. INTERVAL and ACCUMULATION options can be used to create initial arrays in other ways.
OUTARRAY lists new arrays to be created in subsequent processing, here LN_PSTATUS.
OUTSCALAR declares the system scalars to be created in subsequent processing, here SUM_SQ. A scalar is a single, system generated value associated with an array.
The BY statement identifies the sub-setting variable in the input table.
Data step like processing occurs in a SUBMIT block in the TSMODEL procedure. The DO loop runs from the first observation of each sequence to the last. A main point here is that each BY group’s values are operated on as a separate array. It may help to think of the DO loop as sequentially processing each BY group, however it should be noted that in SAS Visual Forecasting BY groups are distributed, and the processing occurs concurrently.
The values of each new array, LN_STATUS are derived by log transforming corresponding values of P_STATUS. System scalars, SUM_SQ, are derived as an accumulating sum of squared values of LN_PSTATUS.

proc tsmodel data = mylib.bygrp_in outarray=mylib.bygrp_out outscalar=mylib.scalars outsum=mylib.summary_stats;
   id date interval=month;
   var p_status;
   by by_grp;
   outarray ln_pstatus;
   outscalar sum_sq;
   submit;
      do t = 1 to dim(p_status);
         ln_pstatus[t] = log(p_status[t]);
         sum_sq += ln_pstatus[t]**2;
      end;
   endsubmit;
run;

Let’s investigate the results of the TSMODEL procedure call and discuss some details.

The MYLIB.BYGRP_OUT table contains the two P_STATUS (group 1 & 2) timeseries defined by the ID, BY and VAR statements and the two new timeseries, LN_PSTATUS created in the SUBMIT block.

The MYLIB.SCALARS table contains the generated system scalars. One scalar is created for each LN_PSTATUS array.

A portion of the columns of the MYLIB.SUMMARY_STATS table is shown. Summary statistics on each of the four timeseries are listed.

Demo 3, Creating and Calling a User Defined Subroutine

This example switches back to SAS 9. In the first part, a user defined subroutine is created. The subroutine is then called in the TIMEDATA procedure to create a new array. The SAS Function Compiler procedure (FCMP, BASE/SAS) lets you to create, test and store SAS functions, CALL routines and subroutines. Here, a subroutine named MYLEAD is created and then stored in a compile library that can be referenced in subsequent steps.

options cmplib = work.timefnc;

proc fcmp outlib=work.timefnc.funcs;
   subroutine mylead(actual[*], transform[*]);
      outargs transform;
      actlen = DIM(actual);
      do t = 1 to actlen;
         transform[t] = (actual[t+1]);
      end;
   endsub;
run;
quit;

The ID and VARS statements combine to create the quarter interval array AIR. The new array, LEADAIR, is declared in the OUTARRAYS statement and then created by calling the MYLEAD subroutine. The arguments, AIR and LEADAIR correspond to ACTUAL and TRANSFORM listed in the definition of the subroutine.

proc timedata data=sashelp.air out=work.air2 outarray=airarray2 print=(arrays);
   id date interval=qtr accumulate=total format=yymmdd.;
   vars air;
   outarrays leadair;
   call mylead(air, leadair);
run;

A portion of the OUTARRY table, AIRARRAY2 is shown.

As you can see, the array approach provides a flexible and efficient approach to timeseries data handling. In the next article, we’ll discuss BY group processing in more detail and provide more in-depth examples. Stay tuned for more data-step-for-timeseries action!

Find more articles from SAS Global Enablement and Learning here.

Data Step for Timeseries: part 1, Overview

Free course: Data Literacy Essentials

Get Started