Multivariate Adaptive Regression Splines (Friedman, 1991) is a nonparametric technique that combines regression splines and model selection methods. It is a powerful predictive modeling tool because 1) it extends linear models to analyze nonlinear dependencies 2) it produces parsimonious models that do not overfit the data and thus have good predictive power. Multivariate adaptive regression splines construct spline basis functions in an adaptive way by automatically selecting appropriate knot values for different variables. This can help E-miners to identify linear and nonlinear variables, and the interactions of them as well. When excluding higher order terms, multivariate adaptive regression splines are really good at identifying the effects of single variables in a multivariate setting. This makes it highly usable in process control and for identifying experimental designs. Multivariate adaptive regression splines also has its application in forecasting as a variable screening tool.
It has always been a desirable tool for our E-miners and now you have multivariate adaptive regression splines as an extension node in Enterprise Miner by just following a few simple steps.
Once deployed, you can find the MARS node under the Applications tab as shown in Figure 1.
Figure 1: MARS node under the Applications Tab on the toolbar
MARS Node Requirements
One or more input variables are required for the MARS node. The data set can contain at most one target variable, either interval or categorical.
If the input data set contains a frequency variable, the frequency variable must be an interval variable and all observations must be positive integers.
MARS Node Properties
Drag a MARS node onto an open diagram, and you will see the property panel as shown in Figure 2.
Figure 2: MARS node properties panel
Here are the descriptions of main properties.
MARS Node Example
This example uses the sample SAS data set SAMPSIO.HMEQ. You must use the data set to create a SAS Enterprise Miner Data Source. Right-click the Data Sources folder in the Project Navigator and select Create Data Source to launch the Data Source wizard.
Drag the HMEQ data set and the MARS node to your diagram workspace. Connect them as shown in the diagram below.
Select the button next to the Keep Effects property to open a term editor. Specify variable Job to be included in the final model as shown in the diagram below, and then click OK.
Run the MARS node with other settings as default by right-clicking on the MARS node and selecting Run. In the Confirmation window, select Yes. After a successful run of the MARS node, select Results in the Run Status window.
Notice the following information:
Bases Transformation Information is a table of the transformations that are used to generate the basis matrix. The first basis function, Basis0, is the intercept. The second basis function, Basis1, is 1 when variable Job has level ‘Sales’ and 0 otherwise. The eleventh basis function, Basis11, is Loan - 40800 when loan > 40800 and 0 otherwise, and 40800 here is a knot value. Other basis functions are constructed in a similar manner by using other knot values. The knots are chosen automatically.
Parameter Estimates is a table of parameter estimates and the selected variables.
Backward Selection Iteration is a plot displays the progression of the backward elimination phase. The GCV criterion provides an estimate of how well the model will perform with new data, so the final model should have good predictive power. The figure below shows that the backward elimination step eliminates basis functions 13, 10, and 11.
ANOVA is an Analysis of Variance (ANOVA) table for the target variable.
Classification Variables is a table of classification variable levels information.
Fit Control Parameters is a table of parameters of spline fitting controls.
Fit Statistics is a table of the fit statistics from the model.
Model Information is a table of MARS model settings.
Variable Importance is a table of input variables, scaled by their relative importance as predictors for the target variable.
Dependent Variable vs. Fitted Values is a plot displays the raw dependent variable overlaid with the fitted values. This plot is not produced for dependent variable with nonnormal distribution.
Residuals vs. Fitted Values is a plot displays the residuals overlaid with the fitted values. This plot is not produced for dependent variable with nonnormal distribution.
**Note: Special thanks to Paal Navestad, Senior Data Scientist @ ConocoPhillips for providing valuable feedbacks on this article.