Improving model performance with data scaling

1 Like

Data scaling (transforming variables so they fit within a specific range or distribution) can improve the performance and interpretability of machine learning and statistical models. Several methods of scaling exist, including standardization, normalization, and midrange scaling. Many supervised and unsupervised machine learning models benefit greatly from variable scaling. Explanatory modelers can scale predictors to facilitate comparisons. For predictive modelers, the variable scaling methods that improve model fit and accuracy most are data dependent. In this post I'll describe several methods of scaling, and I’ll demonstrate empirical assessment of various scaling methods when applied to a logistic regression model using LASSO selection in SAS Viya.

Scaling for machine learning models

Many data sets contain variables that vary greatly in their units and range. For example, when trying to improve the yield of a chemical process, the temperature of the reaction may be in the hundreds while the pressure of that reaction might be in the range of 1 to 2 atmospheres. Several machine learning models are sensitive to differently scaled inputs. For these models, transforming variables to a similar scale is crucial for model fit and predictive accuracy. Differently scaled variables can result in slow model training, collinearity caused by scale differences, and imprecise parameter estimates. Often, the inputs with the greatest range of values will have the largest impact on the model, particularly with models based on distance calculations. Ensuring the inputs are on a similar scale will avoid many of these problems.

Scaling for explanatory modeling

Scaling can be beneficial for model interpretability. Researchers frequently want to compare the effects of different predictors on a response variable Y. How should this be done when the predictors often have different units? One approach is standardization, which converts predictor values into units of standard deviations from the mean. When using standardized variables in regression, the partial regression coefficients will describe the average change in Y for a one standard deviation (SD) change in the predictor. This is not only true for linear regression. In Poisson and Logistic regression on standardized predictors, coefficients represent the average effect on log counts or on the logit, respectively, per 1 SD change in X. If the research goal is to determine which regressor has the greatest impact on Y, one way to answer this question is to report the variable with the greatest magnitude of the standardized regression coefficient.

Additionally, standardized variables are centered on zero. In regression on unscaled variables, the intercept is often not interesting, either because it represents an unrealistic value (e.g., X=0 never occurs) or because it's outside the range of the data. When variables are standardized, the intercept refers to the mean value of Y when X equals the mean (X), instead of when X=zero. Now a regression, say of human height on weight, has an intercept that describes the average height for someone at the mean weight instead of the average height of someone who weights zero pounds.

Explanatory modelers may scale measures of variability as well. The coefficient of variation (CV) is a mean-scaled standard deviation. This unitless measure of variability allows comparison of variability across variables as different as automobile accidents at city intersections and the number of petals on a typical sunflower.

For the rest of this post, I’ll focus on scaling for machine learning and predictive models.

Methods of variable scaling

Three common methods of variable scaling are standardization, normalization, and midrange scale scaling. Standardization involves subtracting the mean from each value and dividing by the standard deviation of the variable. This is often most useful for variables that are normally distributed. Standardization typically compresses normally distributed variables to values ranging between −3 to +3. Extreme outliers can still be well beyond this range, so if models are very sensitive to outliers, other scaling methods may be a better choice.

Normalization can have several meanings, the most common in machine learning modeling is to scale variables to a minimum value of 0 and a maximum value of 1. This works well for data that do not follow a normal distribution. It results in all features being scaled to the same range, ensuring fair treatment of each predictor in machine learning models. This can be important for models that are sensitive to the magnitude of the input data. Note that despite the name, this type of scaling does not make a variable follow the normal (aka Gaussian) distribution. Normalization is done by subtracting the minimum value from each observation and dividing by the range (the maximum value minus the minimum value).

Midrange scaling is a variation of normalization. It scales variables to a minimum value of −1 and the maximum of +1. This scaling is often used for neural network models. For example, the SAS Viya NNET procedure defaults to midrange scaling of all inputs before fitting a neural network. Why is this preferred to normalization? A commonly used activation function in neural networks is the hyperbolic tangent, tanh. Tanh transforms the output from the hidden units (derived nonlinear functions of the predictors) to the range (−1, +1). Midrange scaling more closely aligns the inputs with the output of this function, which can facilitate the estimation of the neural network weights. Midrange scaling subtracts the midrange from each value, then divide by half the range. The midrange is half the sum of the maximum and minimum values.

, where midrange = (x_min + x_max)/2

Which models are sensitive to variable scaling?

Principal component analysis – Principal component analysis is typically done on scaled variables (correlations) rather than unscaled data (covariances). If covariances are used and inputs have greatly different scales, principal components tend to be dominated by the variable with the largest range and are less useful for data summarization and data reduction.

Neural networks – Neural Networks (NNs) have a lot of parameters and thus lots of local error minima likely exist. Putting the variables on the same scale helps speed the optimization and reduces the chance a locally optimal set of weights are found. Using unscaled variables can cause the NN to take many more iterations to converge or sometimes result in failure to converge entirely. In mathematical terms, the optimization process looks for gradients that indicate the steepest change in the direction of reduced error. NNs with unscaled data may be slower because the gradients are largest in the direction of inputs with larger scales than inputs with smaller scales.

K-means clustering, k-nearest neighbors, and support vector machines – What these models have in common is that they involve calculating distances between data points to determine similarity. Distance-based algorithms are sensitive to variable units, so it is common to scale data so that all the predictors can contribute equally to the result.

Generalized linear models and mixed models – This includes logistic regression, Poisson regression, random coefficient models, and many others. What these all have in common is that they use maximum likelihood estimation. While scaling is not required for maximum likelihood estimation, this iterative process can sometimes proceed more smoothly when variables have a similar range. Particularly for complex models, scaling can sometimes fix convergence problems by making the optimization process more stable. The increased stability is because scale differences can cause the algorithm to take more chaotic steps on its way to the optimum.

Which models are do not require variable scaling?

Tree-based models – Decision Trees, Forests, and Gradient Boosting don’t require variable scaling. These tree-based models use hierarchical splitting and are not based on distance calculations and do not use maximum likelihood estimation. In general, tree-based analyses are not sensitive to outliers and do not require variable transformations.

An example of variable scaling for logistic regression using LASSO selection

Next, I’ll demonstrate different methods of scaling on the outcome of a logistic regression model using LASSO selection. I analyzed a banking data set using various types of scaling with the goal of predicting customers likely to purchase an insurance product. These data are a modified version of the develop.sas7bdat data, available in SAS Viya For Learners. The target variable INS is a binary variable indicating 1 for purchase and 0 for no purchase and there were initially 25 inputs. These inputs were relatively uncorrelated due to using the SAS Viya procedure PROC VARREDUCE to perform unsupervised variable selection (used through the Variable Reduction task in SAS Studio) from a larger pool of starting predictors. I then used LASSO selection to find a logistic regression model with the lowest validation misclassification rate.

Least absolute shrinkage and selection operator (LASSO)

LASSO uses a version of least squares estimation which minimizes the error plus a function of the sum of the regression coefficients. This has a result of shrinking the slopes, in some cases down to zero, making the corresponding predictor drop out of the model. Shrinkage also makes the model less sensitive to the training data and reduces the chance of overfitting. The idea is that LASSO adds a small amount of bias to predictions by shrinking coefficients relative to their unbiased estimates to subtract a larger amount of prediction variance, due to the bias-variance trade-off. This optimizing of bias and variance ideally increases the model’s overall predictive accuracy. For a general review of the bias-variance tradeoff see my previous post Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off. For more information on the bias-variance trade off in relation to LASSO, see the introductory SAS data science course “Statistics You Need to Know for Machine Learning”.

Variable scaling using PROC STDIZE

I started by using PROC STDIZE to scale the data sets develop_train and develop_valid. Method=STD, METHOD=RANGE, METHOD=MIDRANGE options perform standardization, normalization, and midrange scaling, respectively. Note that scaling the training data using OUTSTAT=train_std and the validation data using METHOD=IN(train_std), results in the validation data being adjusted using the means and standard deviations from the training data, not their own means and standard deviations. This prevents information leakage in the assessment statistics from the validation performance. For information on information leakage in the context of imputing missing values, please see my previous post How to Avoid Data Leakage When Imputing for Predictive Modeling.

The training data and validation data were then combined because PROC LOGSELECT, like many other SAS Viya supervised modeling procedures, looks for training and validation data within the same data set.

Here is the code:

%let varreducelist=BIN_DDABal branch_swoe IM_CCBal DDA MMBal ILS MTG Checks NSF IM_LORes LOCBal CDBal log_IM_Income 
IRA IM_CRScore Sav ATMAmt Moved CashBk IM_AcctAge SDB IM_POS SavBal InArea IM_Age IM_CC Teller IRABal DirDep DepAmt 
IM_HMVal CD ATM MMCred LOC;

/* standardize training inputs for LASSO */
proc stdize data=mycas.develop_train 
	out= mycas.develop_train_std method=std outstat=train_std;
	var &varreducelist;
run;

/* standardize validation inputs using training means and training std */
proc stdize data= mycas.develop_valid 
	out= mycas.develop_valid_std method=in(train_std);
	var &varreducelist;
run;

/* combine training and validation */
data mycas.develop_final_std;
	set mycas.develop_train_std mycas.develop_valid_std;
run;

/* normalize training inputs for LASSO */
proc stdize data= mycas.develop_train 
	out= mycas.develop_train_norm method=range outstat=train_norm;
	var &varreducelist;
run;

/* normalize validation inputs using training adjustments */
proc stdize data= mycas.develop_valid 
	out= mycas.develop_valid_norm method=in(train_norm);
	var &varreducelist;
run;

/* combine training and validation */
data mycas.develop_final_norm;
	set mycas.develop_train_norm mycas.develop_valid_norm;
run;

/* midrange scale inputs for LASSO */
proc stdize data= mycas.develop_train 
	out= mycas.develop_train_mid method=midrange outstat=train_mid;
	var &varreducelist;
run;

/* midrange scale validation inputs using training adjustments */
proc stdize data= mycas.develop_valid 
	out= mycas.develop_valid_mid method=in(train_mid);
	var &varreducelist;
run;

/* combine training and validation */
data mycas.develop_final_mid;
	set mycas.develop_train_mid mycas.develop_valid_mid;
run;

Next, I used the SAS Viya procedure PROC LOGSELECT with the METHOD=LASSO option to perform LASSO selection. The CHOOSE=VALIDATE option picks the best model from the sequence as the one that has the lowest validation average squared error. A macro was used to save some typing and apply LASSO selection to the three differently scaled data sets as well as the unscaled data.

Note that I used SAS Viya 2022.09 LTS for this demonstration. Starting in SAS Viya 2023.03 LTS, the default behavior of PROC LOGSELECT was to internally scale and center data used in LASSO selection. To turn this off, add the CENTERLASSO=FALSE option to the MODEL statement.

Here is the code:

%macro scaledLASSO (data=);
proc logselect data=&data association;
	partition role=_PartInd_ (validate='0' train='1');
	class res;
	model Ins(event='1')= res &varreducelist/ link=logit;
	selection method=lasso(choose=validate);
run;
%mend scaledLASSO;

%scaledLASSO (data=mycas.develop_final_std);
%scaledLASSO (data= mycas.develop_final_norm);
%scaledLASSO (data= mycas.develop_final_mid);
%scaledLASSO (data= mycas.develop_final);

Here is a summary of fit statistics from the different variable scaling approaches:

Variable scaling	standardization	normalization	midrange	none
Inputs in final model	31	24	23	4
Validation misclassification	0.2714	0.2832	0.2838	0.3413
Validation ASE	0.1789	0.1845	0.1848	0.2194
Validation AUC	0.7781	0.7607	0.7596	0.7013

With these data, standardization performed the best, followed by normalization, then negligibly worse midrange scaling. Is standardization a generally a better scaling approach for LASSO? No, the best scaling for prediction is an empirical question and will often need to be determined through trial and error. I did not have a priori knowledge that standardization would perform best. While the differences in misclassification among scaling approaches were small, all variable scaling approaches performed considerably better than the unscaled data. Scaling is often important, and the best method depends on your research goals and your data.

For more information on many of the topics covered in this post such as variable scaling, the bias-variance trade-off, LASSO selection, and neural network models, consider taking the SAS class “Statistics You Need to Know for Machine Learning”.

SAS Communities Library