We’re smarter together. Learn from this collection of community knowledge and add your expertise.

To standardize data or not to standardize data - that is the question

by SAS Employee dishaw on ‎05-25-2017 03:20 PM - edited on ‎05-25-2017 04:11 PM by Community Manager (2,510 Views)

Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune using variables measured at different scales

Or to take arms against multi-scaled data with a vast array of available SAS Procs, Data step approaches, and CAS Actions

And by doing so, end unplanned analysis bias.

To standardize, to normalize, to leave unchanged - How do I know?

 

Just kidding. We all know William Shakespeare's Hamlet was not worried about preparing data for analysis. But do you? Have you ever wondered if you should standardize or normalize multi-scaled data prior to running analyses? I have, and I hope I'm not alone.

 

The following blog shares my findings, recommendations and provides code snippets to address multi-scaled data using either SAS 9.4 or SAS Viya environments.

 

Just a quick refresher before we begin…

Standardizing means to rescale your data to have a mean of zero and a standard deviation of one. A standardized variable is sometimes called a z-score or a standard score. Here’s the equation to standardize your data:

where x is the original value, μ is the variable’s mean, and σ is the variable’s standard deviation. The absolute value of z represents the distance between the raw score and the population mean in units of the standard deviation. z is negative when the original value is below the mean, positive when above. z is between -3 and 3 for most data.

 

Normalizing is another rescaling method with many meanings in statistics and statistical applications. Most commonly, normalizing rescales numeric data between zero and 1 using the following equation:

where xmin is the variable’s minimum value, and xmax is the variable’s maximum value.

 

Why would I do this? Does it really matter? What is the impact of not standardizing?

Speaking with members of our R&D team and researching online articles, I quickly realized that standardizing multivariate data is important, particularly when variable scales differ significantly.

 

Model stability and parameter estimate precision are influenced during multivariate analysis when multi-scaled variables are used. For example, in boundary detection, a variable that ranges between 0 and 100 will outweigh a variable that ranges between 0 and 1. Using variables without standardization can give variables with larger ranges greater importance in the analysis. Transforming the data to comparable scales can prevent this problem.

 

Standardizing continuous predictor variables in neural network is extremely important.

 

When performing regression analysis, standardizing multi-scale variables can help reduce multicollinearity issues for models containing interaction terms.

 

Standardizing your data prior to cluster analysis is also extremely critical. Clustering is an unsupervised learning technique that classifies observations into similar groups or clusters. A commonly used measure of similarity is Euclidean distance. The Euclidean distance is calculated by taking the square root of the sum of the squared differences between observations. This distance can be greatly affected by differences in scale among the variables. Generally variables with large variances have a larger effect on this measure than variables with small variances. For this reason, standardizing multi-scaled variables is advised prior to performing clustering.

 

Standardizing or normalizing data is also important in principal component analysis (PCA) since it projects your original data onto orthogonal directions which maximize the variance.

 

However, tree-based analyses are not sensitive to outliers and do not require variable transformations. As a result, standardization of multi-scaled data is not necessary for Decision Trees, Random Forest and/or Gradient Boosting algorithms.

 

Here is a list of Standardization Recommendations by Modeling Type:

 

Still don’t believe in the value of standardizing? Need to see an example?

5_bunny.png

Imagine you are tasked with analyzing seasonal donations. Let’s say you have a dataset tracking the number of eggs the Easter Bunny delivered, the number of gifts Santa Claus delivered, the number of shillings Mr. Scrooge donated, and my personal charitable donations.

 

How do you know when an outlier occurs? Does it make sense to compare these variables ‘as is’ in a multivariate analysis? These variables clearly have different ranges. (Well hopefully these variables have different scales. I’d like to think my charitable gift giving is more generous than Mr. Scrooge’s.)

 

The following SAS Report Viewer displays explorations of mock seasonal gifts and donations created in SAS Visual Analytics 8.1. Notice the multi-scaled range of the contributors. Notice the differences in cluster selections for non-standardized and standardized data. Notice how variables with large variances tend to control cluster selection versus variables with small variances.

 

 

The following plots demonstrate the importance of standardizing your data prior to running principal components analysis and variable selection analysis. The first plot below shows the amount of total variance explained in the different principal components for this raw (non-standardized) data.

 

Submitted code: 
title 'Raw Non-Standardized Data - Variance Explained';
proc pca data=casuser.rawdata cov plots=(scree);
        var var1 var2 var3 var4;
run;

 

In this example, running variable selection on the raw non-standardized dataset resulted in only one component to explain all the variance.

Submitted code:
proc varreduce data=casuser.rawdata matrix=cov;
        reduce unsupervised var1 var2 var3 var4 / varianceexplained=0.9;
        ods output selectionsummary=work._VarSelection_summary_;
run;
data work.out_iter (keep=Iteration VarExp Base Increment Parameter);
         set work._VarSelection_summary_;
         Increment=dif(VarExp);
         if Increment=. then
                 Increment=0;
         Base=VarExp - Increment;
run;
proc transpose data=work.out_iter out=work.out_iter_trans;
         by Iteration VarExp Parameter;
run;
proc sort data=work.out_iter_trans;
         label _NAME_='Group';
         by _NAME_;
run;
title 'Raw Non-Standardized Data - Variance Explained by Iteration';
proc sgplot data=work.out_iter_trans;
         yaxis label='Variance Explained';
         vbar Iteration / response=COL1 group=_NAME_;
run;

 

However, when you standardize the data, it is clear that other variables contribute to the data variance. The reason for this is because PCA seeks to maximize the variance of each component.

Submitted code: 
title 'Standardized Data - Variance Explained';
proc pca data=casuser.stddata cov plots=(scree);
        var STD_var1 STD_var2 STD_var3 STD_var4;
run;

 

As a result, running variable selection on standardized data results in several components being used to explain the data variation.

Submitted code:
proc varreduce data=casuser.stddata matrix=cov;
        reduce unsupervised STD_var1 STD_var2 STD_var3 STD_var4 / varianceexplained=0.9;
        ods output selectionsummary=work._VarSelection_summary_;
run;
data work.out_iter (keep=Iteration VarExp Base Increment Parameter);
         set work._VarSelection_summary_;
         Increment=dif(VarExp);
         if Increment=. then
                 Increment=0;
         Base=VarExp - Increment;
run;
proc transpose data=work.out_iter out=work.out_iter_trans;
         by Iteration VarExp Parameter;
run;
proc sort data=work.out_iter_trans;
         label _NAME_='Group';
         by _NAME_;
run;
title 'Standardized Data - Variance Explained by Iteration';
proc sgplot data=work.out_iter_trans;
         yaxis label='Variance Explained';
         vbar Iteration / response=COL1 group=_NAME_;
run;

 

Hopefully it is now clear that standardizing multi-scaled data is extremely important when performing multivariate analysis.

 

How to create a standardized dataset in SAS?

Easy! There are many ways to create new datasets with standardized variables in SAS including running data step code, macros, sql queries, SAS 9.4 procedures, and CAS actions. Here are my recommendations to create a new standardized dataset in SAS 9.4 and SAS Viya:

 

SAS 9.4: Use PROC STANDARD and PROC STDIZE

The STANDARD procedure standardizes variables in a SAS data set to a given mean and standard deviation, and it creates a new SAS data set containing the standardized values.

 

PROC STANDARD data=rawdata mean=0 std=1 out=stdout;

   var var1-var4;

run;

 

The STDIZE procedure in SAS/STAT software standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the well-known standardization methods such as mean, median, standard deviation, range, Huber’s estimate, Tukey’s biweight estimate, and Andrew’s wave estimate are available in the STDIZE procedure.

 

PROC STDIZE data=rawdata out=stdize_out method=RANGE;

     var var1-var4;

run;

 

SAS Viya 3.2: Use dataPreprocess CAS Action

As you may be aware, R&D is actively developing new SAS Viya procedures. Currently, PROC STANDARD and PROC STDIZE are not CAS enabled and thus cannot run in the SAS Viya environment.

 

This is also true for PROC MEANS and PROC UNIVARIATE. But have no fear…there are tons of CAS actions you can easily call from PROC CAS that are super-fast multi-threading substitutes. In fact, my favorite replacement for PROC MEANS is summary in the Simple Analytics Action Set.

 

When it comes to transforming your data for predictive modeling, please consider the transform action in the Data Preprocess Action Set. It is a great ‘go-to’ CAS action for all types of data transformations.

 

For standardization, transform defines standardization as:

where location can be any of the following: mean, median, winsorized mean, trimmed mean, etc. and scale can be: std, mad, gini scale, Tukey-biweight, etc.

 

Here is a code snippet to standardize data in SAS Viya using the transform CAS action:

PROC CAS;
      session mycas;
      dataPreprocess.transform /
      table = {name = "blog_data" where="partition=0"}
      requestPackages = {{inputs = ${var1, var2, var3, var4}
                          function = {method = "standardize"
                          arguments = {location = "mean"
                                       scale = "std"}
                                      }
                         }}
      casout = {name = "blog_std", replace=True}
      outVarsNameGlobalPrefix = "STD"
      copyVars={"id","partition","var1","var2","var3","var4"}
      code={comment=True}
      ;
      run;
quit;

IMPORTANT REMINDER: If you create a separate dataset to standardize your model training data, you should keep the training data standardization parameter information (i.e. standardization score code). This information should be used to standardize your new data to be scored in order to maintain model validity.

 

 

To standardize your data, do you need to create a new standardized dataset in SAS?

Not always. Many SAS 9.4 procedures and SAS Viya procedures/CAS actions include standardization options.

 

For example, the VARREDUCE procedure standardizes input data using correlation matrix by default (MATRIX = CORR) when performing both supervised and unsupervised variable selection in SAS Viya. If the user specifies MATRIX = SSCP, the SSCP is calculated on the original data scale (i.e. no standardization done internally by the proc).

 

Recommendations from R&D are reflected in the procedure defaults. For example, neural networks use midrange to standardize data (i.e. the midrange value is set to 0 and the variables have a minimum of –1 and a maximum of 1). Whereas Support Vector Machine normalizes data using the scale parameter (i.e. input data is rescaled to a value between zero and 1).

 

 

To summarize

Standardizing multi-scale data prior to performing multivariate analysis is highly recommended and simple to do. Just remember to incorporate your training data standardization information into your modeling score code.

 

 

Acknowledgements:

I would like to say a special thank you to Radhikha Myneni for providing a technical review of this blog. Also, thank you to Biruk Gebremariam, Ruiwen Zhang, and Taiping He for confirming details about SAS Viya CAS actions and default procedure configurations.

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.