Data Step for Timeseries: part 2, BY Group Processing

1 Like

BY group processing was introduced in the context of data-step-for-timeseries in Part 1 of this series. A table that will be used for BY group processing has sequences stacked on top of each other, and each sequence is processed as a separate array. Even though each BY group or array is operated on independently, there can be a hierarchical arrangement in the data that’s defined by the BY groups. The purpose of this post is to present examples of BY group processing for timeseries, and the focus will be on how BY groups can be arranged to create nested tables of timeseries with a useful hierarchical structure.

Large-scale timeseries applications generally consume tables that are arranged hierarchically, so the demonstrations in this blog implement the large-scale tools in SAS Visual Forecasting. To start, consider a table that contains observations on sales, prices and promotions of wine over time. The BY variables are REGION (REG1 – REG4) and TYPE (VINTAGE, VALUE, TBLWT (table white) and TBLRE (table red)). A portion of the table is shown below.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Sequences on sales, prices and promotions that flow into the data-step-for-timeseries tools are assumed to be transactional. The analyst’s job is to create the timeseries that will be used for analyses. Choices related to the time index (interval) and accumulation methods can add relevance and increase business value if the choices are consistent with the underlying patterns in the data and business practices. BY variables provide an additional way to add relevance and usefulness by adding structure to the analysis data.

For example, a two-level data hierarchy that represents brand or product type SALES flows can be created from the table shown above. It’s arranged as follows.

There are 4 wine type SALES timeseries at the top level, and there are 16 wine SALES type, region pairs on the bottom level of the data hierarchy. This arrangement may be optimal if, for example, production, distribution, pricing and marketing activities are made on the basis of wine brands or types. The following code creates this two-level data hierarchy.

Demonstration 1

This code creates timeseries at the TYPE level of the data hierarchy. Plots of the four wine TYPE SALES arrays are generated. The variable listed on the BY statement defines the level of the data hierarchy that is being created.

proc tsmodel data = mylib.wineco outarray=mylib.typeseries
                    outsum=mylib.typesum;
   by type;
   id date interval=week;
   var sales /acc = sum;
   var baseprice promotion/acc = avg;
run;

proc sgplot data=mylib.typeseries;
   series x=date y=sales / group=type;
run;

The timeseries arrays at the TYPE, REGION level are created next by adding the REGION level to the BY statement. Output data set names have been changed. SALES timeseries arrays are then plotted. Note that the Data Step syntax creates a combined group variable to make plotting the 16 series easier.

proc tsmodel data = mylib.wineco outarray=mylib.typeregseries
                    outsum=mylib.typeregsum;
   by type region;
   id date interval=week;
   var sales /acc = sum;
   var baseprice promotion/acc = avg;
run;

data plotin;
   set mylib.typeregseries;
   sep='_';
   grp = cats(type, sep, region);
run;

proc sgplot data=plotin;
   series x=date y=sales / group=grp;
run;

An important detail is that while each array in the data is operated on independently, the ID, VAR and BY statements combine to uniquely define the timeseries arrays that are created.

Alternatively, assume that decisions related to production, distribution, pricing and marketing activities are made based on geographic or regional sales flows. The following hierarchical arrangement may be optimal in this scenario.

Demonstration 2 This code creates timeseries at the REGION level of the data hierarchy. A plot of the four wine SALES series at this level of the data hierarchy are generated. Note the BY statement defines timeseries at the REGION level of the data hierarchy.

proc tsmodel data = mylib.wineco outarray=mylib.regseries
                    outsum=mylib.regsum;
   by region;
   id date interval=week;
   var sales /acc = sum;
   var baseprice promotion/acc = avg;
run;
proc sgplot data=mylib.regseries;
   series x=date y=sales / group=region lineattrs=(thickness=0.5);
run;

The timeseries at the REGION, TYPE level of the data hierarchy are created next by adding the TYPE variable on the BY statement. Output data set names have also been changed.

New feature creation (bonus!). In addition to the BY group processing, two new arrays are created for each of the sixteen BY groups in this call to the TSMODEL Procedure. EASTER and XMAS are binary arrays that may be useful as input variables to capture variation associated with recurring, holiday events. EASTER is 1 for week intervals that contain Easter Sunday and zero otherwise. XMAS is a binary array that flags the week intervals that contain 25DEC and zero otherwise. YEAR, WEEK and HOLIDAY are BASE/SAS functions. See Part 1 of this blog series for a discussion of creating new timeseries arrays with SUBMIT blocks in TSMODEL.

proc tsmodel data = mylib.wineco outarray=mylib.regtypeseries
                    outsum=mylib.regtypesum;
   by region type;
   id date interval=week;
   var sales /acc = sum;
   var baseprice promotion/acc = avg;
   outarrays easter xmas yr;
   submit;
      do t=1 to dim(sales);
         yr[t]=year(date[t]);
         EASTER[t] = (week(date[t])=week(holiday('EASTER',yr[t])));
         XMAS[t] = (week(date[t])=week(holiday('CHRISTMAS',yr[t])));
      end;
   endsubmit;
run;

 data plotin;
   set mylib.regtypeseries;
   separator='_';
   grp = cats(region, separator, type);
run;

proc sgplot data=plotin;
   series x=date y=sales / group=grp lineattrs=(thickness=0.5);
run;

A portion of the OUTARRAY table, REGTYPESERIES, that contains the new features is shown.

We’ve shown how BY groups work in the context of processing timeseries as arrays with a focus on how different arrangements of BY variables can be used to create and define arrays in different ways. While each array in each level of the data is operated on independently, BY variables can be used to create, useful, nested arrangements of data. A core idea is that the hierarchical data produced in BY group processing should be consistent with business practices and the underlying patterns in the data. This leads to increased relevance, value, and efficiency in subsequent modeling and post-processing steps.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library