We’re smarter together. Learn from this collection of community knowledge and add your expertise.

SAS Visual Forecasting 8.1 – Data Manipulation with PROC TSMODEL

by Occasional Contributor Jack_Zhang on ‎06-12-2017 08:02 AM - edited on ‎07-11-2017 08:03 AM by Community Manager (1,288 Views)

SAS Visual Forecasting 8.1 is a new forecasting solution on SAS Viya™ and it leverages the power of SAS Cloud Analytic Services (CAS) architecture for time series forecasting in large scale. The PROC TSMODEL is a CAS enabled procedure in SAS Visual Forecasting 8.1 that executes user-defined programs on time series data. It takes input data from CAS tables, processes the input data on the CAS server, and then stores the output data into CAS tables. In this blog, I’ll provide an overview of TSMODEL procedure from a data manipulation perspective. I’ll also describe the input data options, how the input data is used and processed, and the output data options using some sample code.

 

An Overview of TSMODEL Procedure from a Data Manipulation Perspective

 

With timestamped transactional data loaded in a CAS table, you can use the TSMODEL procedure to accumulate the data to create a fixed-time interval time series as specified in your program statements. CAS represents these time series vectors as array variables and executes your program statements independently on all the time series vectors within a specific BY group. From a data processing perspective, the TSMODEL procedure is quite similar to the SAS DATA step, but unlike the SAS DATA step, which processes data row by row, the TSMODEL procedure processes time series vectors at the BY group level. The actual data processing job in PROC TSMODEL runs on a CAS server. In a distributed CAS environment, the time series are delineated and partitioned based on the distinct values of the variables that are listed in the BY statement. The time series data are processed in parallel and they are written out to CAS tables on each worker node. Furthermore, threads are used on each worker node to process the time series vectors that are loaded onto a node concurrently. Massive parallel processing within a distributed architecture is one of the key advantages in SAS Visual Forecasting 8.1 for large scale time series forecasting. The following two diagrams describe how PROC TSMODEL processes data on a CAS server with multiple worker nodes.

 

picture1.png

 

When transactional data is loaded from a distributed file system, such as Hadoop or Teradata, data associated with one particular time series is likely located on multiple worker nodes. The TSMODEL procedure creates time series data from the transactional input data and it “shuffles” the resulting time series vectors onto worker nodes so that each worker node receives the entire time series within a specific BY group.

 

picture2.png

 

You can further manipulate the time series vectors formed from transactional input data by adding new components into the time series vectors using user-defined programs; those programs run on the distributed CAS environment in parallel. The output time series data in a CAS table typically includes the following columns / variables:

  • BY variables, each distinct combination of BY groups defines a time series
  • ID variable, which is time ID
  • Numeric variables, each numeric variable / column is considered as an array
  • Scalar variables, each scalar variable / column has a unique value for each distinct combination of BY groups; and the data type is either numeric or char.

The following statements are available in the TSMODEL procedure:

 

PROC TSMODEL options ;
    BY variables ;
    ID variable INTERVAL=interval < options > ;
    OUTARRAYS array-name-list ;
    OUTSCALARS scalar-name-list ;
    INSCALARS scalar-name-list ;
    VAR variable-list / options ;
    REQUIRE package-list ;
    SUBMIT < FILE=[SAS-file-ref |“File-path”] > < submit-options > ;
        Program Statements ;
    ENDSUBMIT ;

For experienced SAS users, the counterpart of PROC TSMODEL in SAS 9.4 is PROC TIMEDATA in SAS/ETS software. While their syntax looks very similar, there are important differences between PROC TIMEDATA and PROC TSMODEL. One example of such a difference is that PROC TSMODEL requires an ID statement, while it is optional to PROC TIMEDATA. In the current release, PROC TSMODEL does not support custom intervals, something that is available in PROC TIMEDATA. To my knowledge, there are ongoing activities to provide the same features and functions in PROC TSMODEL as those available in PROC TIMEDATA.  As such, it is likely that custom intervals will be supported by PROC TSMODEL in a future release. However, there are significant differences from a data processing perspective; the main difference between these two procedures are the following:

  • PROC TIMEDATA requires the input data to be sorted by the BY variables while PROC TSMODEL does not required sorted input data;
  • Unlike PROC TIMEDATA, no actual processing of the time series data occurs in PROC TSMODEL on SAS client of the CAS server; data in PROC TSMODEL is processed on the CAS server through a CAS session.

The TSMODEL procedure also has similarity to the TIMESERIES procedure in SAS/ETS in terms of using program statements to perform time series analysis. However, the TIMESERIES procedure enables you to perform a variety of standard time series analysis techniques with its various statements; while PROC TSMODEL provides no built-in time series analysis capabilities,  and you must define your own analysis via user-defined program statements in the SUBMIT and ENDSUBMIT block.

 

Primary Input and Output Table Options with PROC TSMODEL

 

The TSMODEL procedure requires you to specify CAS tables for all input data and all output data. Within the PROC TSMODEL, you can automatically create time series data by accumulating numerical variables from the input table into the corresponding time interval. You can also create new columns from the input data by using scripting language programs that are submitted to and executed on the CAS server. Let’s look at a simple code snippet.

 

/* this script illustrates the use of PROC TSMODEL and Scripting 
   Language to create time series data with additional columns 
   from timestamped data in a CAS table. */

/* create a CAS session and a CAS library */
cas mycas;
libname mycaslib cas sessref = mycas;
/* loading the pricedata table from SASHELP library to a CAS table */
data mycaslib.pricedata;
     set sashelp.pricedata;
run;
/* using PROC TSMODEL to create time series data in CAS tables */
proc tsmodel   data = mycaslib.pricedata
               out = mycaslib.timeseries
               outarray  = mycaslib.outarray
               outscalar = mylcasib.outscalar ;
     id date interval=month;
     var sale / acc = total;
     var price/ acc = average;
     by regionname productline productname;
     outarrays lnSale1 lnSale2 leadSale;
     outscalars avgSale;
     /* scripting language programming statements */
     submit;
            /*function to compute ln of the input scalar */
            function lnSaleFunc(x);
                     return(log(x));
            endsub;
            /*subroutine to compute ln of the input array*/
            subroutine lnSaleSub(x[*], y[*]);
                       outargs y;
                       do i = 1 to dim(x);
                          y[i] = log(x[i]);
                       end;
            endsub;
            /*get the lead of sale */
            do i = _LENGTH_ to 1 by -1;
               if i = _LENGTH_ then leadSale[i] = .;
               else leadSale[i] = sale[i+1];
            end;
            /*compute the average and taking the ln using the function*/
            avgSale = 0;
            do i = 1 to dim(sale);
               avgSale += sale[i];
               lnSale1[i] = lnSaleFunc(sale[i]);
            end;
            avgSale = avgSale/_LENGTH_;
            /*taking the ln using the subroutine*/
            call lnSaleSub(sale, lnSale2);
     endsubmit;
run;

The code above has one input table and generates three output tables as described below. The input table mycaslib.pricedata has three CHAR variables, one Time ID variable and a number of NUMERIC variables including sale, price, and discount. The first output table is mycaslib.timeseries, which contains the time series data created through TSMODEL statements. It contains only those variables that are specified in the ID, BY and VAR statements, where accumulation methods are specified for sale and price.

 

table1.png

 

The second output table is mycaslib.outarray, which contains the time series data created by TSMODEL statements and user-defined programs. It contains all variables that are specified in the ID, BY, VAR and OUTARRAYS statements. The three variables lnSale1, lnSale2, and leadSale defined in the OUTARRAYS statement are created using a function and a subroutine as shown in the scripting language programs between submit and endsubmit block. Note that two predefined array variables: _SEASON_, and _CYCLE_ and the _STATUS_ variable are also included in this output table.

 

table2.png

 

The third output table is mycaslib.outscalar, which contains the scalars created by TSMODEL statements and the user-defined program. It contains variables that are specified in the BY and OUTSCALARS statements. In this code snippet, the avgSale variable is initialized in the OUTSCALARS statement, while its calculation is defined in the scripting language programs between submit and endsubmit block. Note that there is one unique scalar value for each distinct combination of BY group and the _STATUS_ variable is also included in this output table.

 

table3.png

 

The _STATUS_ variable that appears in the OUTARRAY=, OUTSCALAR=, and OUTSUM= tables contains a value that specifies whether the analysis has been successful or not. It can have the following values:

 

0              Analysis was successful.
3000       Accumulation failed.
4000       Missing value interpretation failed.
6000       Series is all missing.
9000       Descriptive statistics could not be computed.

The TSMODEL procedure creates predefined array variables including _TIMEID_, _SEASON_, and _CYCLE_ and predefined scalar variables including _FORMAT_, _INTERVAL_, _LEAD_, _LENGTH_, _SERIES_, _SEASONALITY_, and _THREAD_. For more information about the TSMODEL procedure, please click here.

 

It is important to remember that these predefined variable names must not be used as variable names in any input data set. However, they are available to be referenced in the scripting language code statements in your programs. For example, the predefined scalar variable _LENGTH_ is used in the code example above.

 

Advanced Input and Output Table Options with PROC TSMODEL

 

The TSMODEL procedure also has the following input and output options.

  • AUXDATA = <mycaslib.auxdata>

It specifies an auxiliary table that provides time series variables that are required for processing but are not included in the table that is specified in the DATA= option. You can specify multiple AUXDATA= options in the PROC TSMODEL statement. Each AUXDATA= option establishes an auxiliary table source to supply variables that are declared in subsequent statements in the procedure step. If no auxiliary data sources are required, then the AUXDATA= option can be omitted. Variables referenced in the TSMODEL procedure can reside in either the primary table (specified in DATA=option) or an auxiliary table (specified in AUXDAT=option). For variables included in ID, BY and VAR statements of PROC TSMODEL, they have different requirements related to the primary and an auxiliary table.

 

table4.png

 

  • OUTSUM = mycaslib.outsum

It names the output table to contain the descriptive statistics. The descriptive statistics are based on the accumulated time series when the ACCUMULATE= option, the SETMISSING= option, or both are specified in the ID or VAR statements. This table is particularly useful when you want to analyze large numbers of series and you need a summary of the results.

 

table5.png

 

  • INSCALAR = <mycaslib.inscalar>

It specifies a table to supply scalar dynamic variables to be included and made accessible to your program code as it executes. For example, you can create a table of time series with distinct BY groups and one or more attributes that maps the time series into multiple groups based on attribute values. Then you can use this table as an INSCALAR input table and dynamically assign different diagnose specs to certain time series groups in your programs. For consistent results, you should prepare the input table such that only a single value is input for each BY group.

 

  • OUTOBJ = (outset=mycaslib.outest outspec=mycaslib.outspec)

 It specifies two pairs, each of which binds a collector object with an output table. You can specify one or more object-table pairs as needed to associate the collector objects that you declare in your user-defined program with their output tables. You must specify a binding for any collector object that you declare in your program. Otherwise, a parse-time error is generated when you submit the program and no execution occurs.

 

  • INOBJ = (inest=mycaslib.outest inspec=mycaslib.outspec)

It specifies two pairs, each of which binds a repeater object with an input table. You can specify one or more object-table pairs as needed to associate the repeater objects that you declare in your user-defined program with their input tables. You must specify a binding for any repeater object that you declare in your program. Otherwise, a parse-time error is generated when you submit the program and no execution occurs. Collector and repeater objects are defined in various packages that use PROC TSMODEL as a method to specify input data that are required for each application. For more information about time series packages and objects in Visual Forecasting 8.1, please click here.

 

Conclusion

 

The TSMODEL procedure provides a convenient way to create time series data from timestamped transactional data in a CAS environment. It enables you to write your own programs to manipulate time series data using the scripting language.  One example would be to create new variables. All data processing jobs within the procedure are executed in parallel through multi-threading on the CAS distributed worker nodes, which is highly efficient. The PROC TSMODEL forms time series data from a primary input table (DATA= option) and auxiliary data sources (AUXDATA= options), which are analyzed according to the following steps if the relevant option listed on the right is specified:

  1. Accumulation: ACCUMULATE= option in the ID or VAR statement
  2. Missing value interpretation: SETMISSING= option in the ID or VAR statement
  3. Program execution: user-defined program statements
  4. Descriptive statistics: OUTSUM= option

The output of the TSMODEL procedure may include time series tables, collector objects, summary descriptive statistics table and log tables. All outputs are saved as CAS tables. The procedure also produces a number array variables and scalar variables that are available and can be used in your programs. Once the desired time series data is created in a CAS table, you can use time series packages such as ATSM (Automatic Time Series Modeling) to model and forecast the time series.  These capabilities will be demonstrated in my next blog.

 

Relevant Blogs

 

SAS Visual Forecasting 8.1 – A New Scalable Efficient Flexible Forecasting Solution

  • Overview and background of SAS Visual Forecasting 8.1
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.