Custom Light Gradient Boosting Machine Node in SAS Visual Forecasting

1 Like

Introduction

This blog focuses on the development and use of a custom Light Gradient Boosting Machine node within SAS® Visual Forecasting, Light Gradient Boosting Machine (LGBM) is a specialized and efficient implementation of the Gradient Boosting Machine (GBM) algorithm, tailored to improve speed and reduce memory usage. While GBM is widely recognized for its accuracy in predictive modeling, it can be computationally intensive, particularly with large datasets. LGBM addresses these challenges by employing techniques like histogram-based methods and exclusive feature bundling, which significantly reduce training time. In the context of time series forecasting, where large volumes of data and real-time predictions are common, LGBM’s efficiency makes it an ideal choice, enabling faster and more resource-efficient model training without sacrificing accuracy.

In the development of a custom node for the LGBM in SAS® Visual Forecasting, the foundational work done by Yue Li, Jingrui Xie, and Iman Vasheghani Farahani is pivotal. Their contributions laid the groundwork for writing custom Gradient Boosting Machine node in SAS® Visual Forecasting. (Writing a Gradient Boosting Model Node for SAS Visual Forecasting). However, the challenge arises from the fact that PROC GRADBOOST, as discussed in their paper, differs significantly from PROC LIGHTGRADBOOST, particularly in its support for the CODE statement. The CODE statement writes SAS DATA step code for computing predicted values of the fitted model to a file or to a table. To score new data, you can then include the file in a DATA step. The absence of the CODE statement in PROC LIGHTGRADBOOST presents a significant obstacle when generate predictions, especially when the lag of the dependent variable is greater than zero and recurrent dependent variable values need to be generated for future periods. This limitation necessitates reliance on PROC ASTORE to score the data, which is also common across many SAS machine learning and econometrics procedures.

The contributions of the custom LGBM node are threefold:

It enables the application of LGBM for time series forecasting in SAS® Visual Forecasting.
It provides an efficient framework that utilizes PROC ASTORE to generate forecasts, which can be easily extended to any SAS procedures compatible with PROC ASTORE.
The node has been extended to include options for separating the fitting and forecasting processes, offering greater flexibility in model deployment.

Files Required to Define a Pluggable Modeling Strategy

To define a pluggable modeling strategy in SAS® Visual Forecasting, three key files are required:

code.sas, which contains the executable SAS code.
validation.xml, which outlines the validation rules for the strategy specification settings.
template.json, which defines the strategy's metadata.

These files are typically bundled in a .zip file for easy upload or download from The Exchange in SAS Visual Forecasting. Users can download an existing strategy, modify the files as needed, and then upload the adjusted strategy to ensure correct implementation. Details can be found in the paper: Writing a Gradient Boosting Model Node for SAS Visual Forecasting

Options for Custom LGBM Node

The custom Light Gradient Boosting Machine (LGBM) node in SAS® Visual Forecasting provides a range of configurable options to tailor the forecasting or fitting process to specific needs. Each option is set using macro variables, which controls specific aspects of the LGBM model fitting and forecasting process based on the user's input selections in the configuration:

Holdout Sample Size (_holdoutSampleSize): This option defines the size of the dataset that will be used as a holdout sample for model evaluation. It is an integer value and is required to be specified, with a minimum allowed value of 0.
Holdout Sample Percentage (_holdoutSamplePercent): This option specifies the percentage of data to hold out from training, provided as a double value. It is required, with values allowed between 0 and 100, where 100 is exclusive.
Task (_task): The variable _task defines the model's task, set to either 'Forecast' or 'Fit'. When set to 'Forecast', the %lgbm_forecast`macro is called to generate forecasts using a pre-trained model. When set to 'Fit', the model first trains with %lgbm_fit and then calls %lgbm_forecast to generate forecasts based on the trained model.
Trend (_trend): The variable _trend specifies whether to include a trend component for the dependent variable. It is required and can be set to 'None' (no trend), 'Linear' (linear trend), or 'Damptrend' (damped trend).
Seasonal Dummy (_seasonDummy): This option indicates whether seasonal dummy variables should be included as independent variables in the model. It is a required boolean value.
Seasonal Dummy Interval (_seasonDummyInterval): The variable _seasonDummyInterval specifies the time interval used to create seasonal dummy variables. It is enabled only when _seasonDummy is set to true, and is an optional string value.
Lags for Independent Variables (_lagXNumber): This option defines the number of lags for the independent variables. It is required, with a minimum value of 0.
Lags for Dependent Variable (_lagYNumber): The variable _lagYNumber specifies the number of lags for the dependent variable. It is a required integer with a minimum value of 0.
Dependent Variable Transformation (_depTransform): This option specifies the transformation to apply to the dependent variable. It is required and can be set to 'NONE' (no transformation) or 'LOG' (logistic transformation).

Main Code for Custom LGBM Node

In this section, we outline the SAS code that can be executed within SAS Visual Forecasting pipelines to perform time series forecasting using the custom Light Gradient Boosting Machine (LGBM) node. The code mainly comprises by two macros: %lgbm_fit, %lgbm_forecast. The code for custom LGBM node can be made available upon request to SAS employees. Additionally, only the essential parts of the entire code are demonstrated. we will not show the code that are overlapped with those presented in the previous paper. The input data &vf_libIn.."&vf_inData"n of custom node is prepared by %fx_prepare_input macro. The following is a table describes the variables of the input table:

Train a LGBM and Promote the Model

   /* Train the Light Gradient Boosting Model */
    proc lightgradboost
            data=&vf_libOut.."&vf_tableOutPrefix..fxInData"n(where=(_roleVar=1))
            validdata=&vf_libOut.."&vf_tableOutPrefix..fxInData"n(where=(_roleVar=2))
            seed=12345;
        input &vf_byVars /level=NOMINAL;
        %if "&vf_indepVars" ne "" %then %do;
            input &vf_indepVars /level=INTERVAL;
        %end;
        %if %intervalFeatureVarList ne %then %do;
            input %intervalFeatureVarList /level=INTERVAL;
        %end;
        %if %nominalFeatureVarList ne %then %do;
            input %nominalFeatureVarList /level=NOMINAL;
        %end;
        target &targetVar / level=interval;
        autotune maxtime=3600
        tuningparameters=(lasso(lb=0 ub=1 init=0));
        savestate rstore=&vf_libOut.."&vf_tableOutPrefix..lgbmStore"n;
    run;

    /*promote the trained model table to ensure it can be used in forecasting task*/
    proc cas;
        table.promote / name="&vf_tableOutPrefix..lgbmStore" caslib="&vf_caslibOut";
    quit;

The %lgbm_fit macro is designed for the process of fitting a Light Gradient Boosting Machine within the SAS Visual Forecasting environment. It begins by prepares the input data using the previous defined %fx_prepare_input macro. The macro then proceeds to train the LGBM model using the PROC LIGHTGRADBOOST procedure, applying specified input variables and tuning parameters. The SAVESTATE statement creates an analytic store for the model and saves it as a binary object in a data table. You can use the analytic store in the ASTORE procedure to score new data. Next, the macro promotes the trained model table for use in forecasting tasks and prepares the required output tables.

Generate forecasts

There are two scenarios when it comes to forecasting. In the first scenario, if there are no lagged values of the dependent variable in the model, forecasts can be generated in one step using PROC ASTORE. In the second scenario, when the model includes lagged values of the dependent variable, forecasts must be generated iteratively, accounting for the dependencies on previous time steps.

%if %eval(&_lagYNumber=0) %then %do;
    proc astore;
        score data=&vf_libOut.."&vf_tableOutPrefix..fxInData"n 
            out=&vf_libOut.."&vf_tableOutPrefix..scored_lgb"n
            rstore=&vf_libOut.."&vf_tableOutPrefix..lgbmStore"n 
            copyvars=(&vf_byVars &vf_timeID &targetVar &vf_depVar);
    run;
%end;

%if %eval(&_lagYNumber>0) %then %do;
    /* Score both the training and validation data */
    proc astore;
        score data=&vf_libOut.."&vf_tableOutPrefix..fxInData"n(where=(_roleVar in (1, 2)))
            out=&vf_libOut.."&vf_tableOutPrefix..scored_lgb_train"n
            rstore=&vf_libOut.."&vf_tableOutPrefix..lgbmStore"n
            copyvars=(&vf_byVars &vf_timeID &targetVar &vf_depVar &vf_indepVars %intervalFeatureVarList);
    run;

    /* Create forecasting dataset */
    data &vf_libOut.."&vf_tableOutPrefix..fxInData_forecasting"n;
        set &vf_libOut.."&vf_tableOutPrefix..fxInData"n(where=(_roleVar=0));
        &predictVar = .;
    run;

    /* The number of iterations should be the steps of forecasting */
    %do i=1 %to %eval(&vf_lead);
        %let date_idx = %sysfunc(intnx(&vf_timeIDInterval, &vf_horizonStart, %eval(&i-1)));
        proc astore;
            score data=&vf_libOut.."&vf_tableOutPrefix..fxInData_forecasting"n(where=(&vf_timeID=&date_idx))
                out=&vf_libOut.."&vf_tableOutPrefix..score_next"n
                rstore=&vf_libOut.."&vf_tableOutPrefix..lgbmStore"n
                copyvars=(&vf_byVars &vf_timeID &targetVar &vf_depVar);
        run;

        /* Rename the column in score_next */
        proc cas;
            table.alterTable/ name="&vf_tableOutPrefix..score_next", caslib="&vf_caslibOut", columns = {{name="&predictVar", rename="_inscalar_&predictVar"}};
        quit;

        /*Using TSMODEL to foward fill the lags of Y with predicted values */
        proc tsmodel data=&vf_libOut.."&vf_tableOutPrefix..fxInData_forecasting"n
            inscalar=&vf_libOut.."&vf_tableOutPrefix..score_next"n
            outarray=&vf_libOut.."&vf_tableOutPrefix..fxInData_forecasting"n;
            by &vf_byVars;
            id &vf_timeID interval=&vf_timeIDInterval;
            var &targetVar &predictVar &vf_depVar &vf_indepVars %intervalFeatureVarList;
            inscalars _inscalar_&predictVar;
            submit;
                &predictVar.[&i.] =_inscalar_&predictVar;
                %do j=1 %to &_lagYNumber;
                    %if (&i+1-&j > 0) %then %do;
                        if _lagY&j[&i+1] = .  then do;
                            _lagY&j[&i+1] = &predictVar[&i+1-&j];
                        end;
                    %end; 
                %end;
            endsubmit;
        run;    
    %end;

    /* Concatenate both train table and forecasting table together */
    data &vf_libOut.."&vf_tableOutPrefix..scored_lgb"n;
        set &vf_libOut.."&vf_tableOutPrefix..scored_lgb_train"n &vf_libOut.."&vf_tableOutPrefix..fxInData_forecasting"n;
    run;
%end;

In the first part of the code (within the %if block for &_lagYNumber=0), it uses PROC ASTORE to directly generate forecasts by scoring the dataset in one step.

In the second part (within the %if block for &_lagYNumber>0), if the model contains lagged values of the dependent variable, the code iteratively generates forecasts. First, the training and validation data are scored and saved as scored_lgbm_train_fit. Then, a separate forecasting dataset is created where the dependent variable (&predictVar) is set to missing. Note that there is no need to extend the independent variables, as they are already fully prepared and extended when the input data is initialized. Therefore, the primary focus of the iterative process is on extending the dependent variable.

The iterative process begins, looping over the forecast horizon (&vf_lead), with forecasts generated one step at a time. After each forecast, the predicted value is added back into the dataset using PROC TSMODEL to update the lags of the dependent variable. This step ensures the model accounts for the lagged structure in future iterations. The results from each iteration are processed and the predicted value is renamed for clarity. After all iterations are completed, the training data and forecasting data are concatenated. It worth to note that using PROC ASTORE to score all by-groups in parallel at a specific time step is efficient, as it speeds up the forecasting process. Additionally, PROC TSMODEL is also used efficiently to fill in the predicted values for the next iteration, as it handles the forward-filling process in parallel across different by-groups, further enhancing the overall performance.

Reuse the Trained Model

/*First of all, checking if the trained lgbm table is available*/
%if %sysfunc(exist(&vf_libOut.."&vf_tableOutPrefix..lgbmStore"n)) %then %do;
    %put The trained model table exists.;
%end;
%else %do;
     %put ERROR: You must run FIT task to create the data that the FORECAST task needs.;
     %abort cancel;
%end;

/*Then, check if the lag specifications are consistent*/
data _null_;
    set &vf_libOut.."&vf_tableOutPrefix..lagTableFit"n;
    call symputx('_fitLagYNumber', _lagYNumber); /* Create macro variable from _lagYNumber */
    call symputx('_fitLagXNumber', _lagXNumber); /* Create macro variable from _lagXNumber */
run;

proc cas;
    errorStatus = 0;
    lagYNumberFit = &_fitLagYNumber;
    lagXNumberFit = &_fitLagXNumber;
    /* Compare the current values with the specs from the fitting process */
    if lagYNumberFit ne &_lagYNumber or lagXNumberFit ne &_lagXNumber then do;
        /* Use put() for converting numbers to strings */
        print (error) 'The model was fitted with _lagYNumber = ' || (string) lagYNumberFit||
                    ' and _lagXNumber = ' || (string) lagXNumberFit || '.';
        print (error) 'Please change the values to match these specifications.';
        errorStatus = 1;
    end;
    symputx('preProcessErrorStatus', errorStatus);
quit;

/* Abort if there is an error */
%if &preProcessErrorStatus ne 0 %then %abort;

The %lgbm_forecast macro automates the forecasting process using a pre-trained Light Gradient Boosting Machine (LGBM) model within SAS Visual Forecasting. It first checks if the trained LGBM model table (lgbmStore) exists; if not, the macro halts and prompts the user to run the FIT task. Next, it ensures that the lag configuration in forecasting process matches the lag settings (stored in &vf_libOut.."&vf_tableOutPrefix..lagTableFit"n) used during the fitting process. If there’s a discrepancy, it provides an error message and stops the execution. The forecasting code remains largely the same as previously discussed, with the key difference being the reuse of the pre-trained model, eliminating the need to train a new one.

/*run lgbm_fit ot lgbm_forecast according to the OPTION(FORECAST/FIT)*/
%if %upcase(&_task) eq FIT %then %do;
    %lgbm_fit;
    %lgbm_forecast;
%end;

%if %upcase(&_task) eq FORECAST %then %do;
    %lgbm_forecast;
%end;

This SAS code block conditionally executes the %lgbm_fit and %lgbm_forecast macros based on the value of the _task variable. If _task is set to "FIT", the %lgbm_fit macro is called to train the model, followed by %lgbm_forecast to generate forecasts. If _task is set to "FORECAST", only the %lgbm_forecast macro is called to generate forecasts using the pre-trained model.

Summary

In conclusion, the development of the custom Light Gradient Boosting Machine (LGBM) node within SAS® Visual Forecasting represents a significant advancement in the application of gradient boosting techniques for time series forecasting. By building on previous work on a custom node for GBM, we have successfully adapted LGBM for use in this context, overcoming the challenges posed by the absence of the CODE statement in PROC LIGHTGRADBOOST. The reliance on PROC ASTORE for scoring not only addresses these challenges but also establishes a versatile framework that can be extended to other SAS procedures. Additionally, the use of PROC TSMODEL has proven efficient in filling in the lags of predicted values for subsequent forecasting iterations, further streamlining the process. The custom node also includes an option to reuse the trained model and predictions from the fitting task, minimizing the need for redundant calculations and making the process more efficient. The ability to separate the fitting and forecasting processes further adds to its utility in real-world applications. Ultimately, this work adds a valuable tool for time series analysis in SAS® Visual Forecasting.