3 weeks ago
TarekElnaccash
SAS Employee
Member since
09-02-2016
- 12 Posts
- 0 Likes Given
- 0 Solutions
- 0 Likes Received
-
Latest posts by TarekElnaccash
Subject Views Posted 366 3 weeks ago 550 01-21-2025 02:43 PM 1158 01-21-2025 02:18 PM 1095 01-15-2025 02:37 PM 2032 01-13-2025 05:04 PM 1284 10-24-2024 02:39 PM 862 10-24-2024 02:19 PM 2031 10-24-2024 02:09 PM 921 02-19-2024 02:11 PM 2032 01-17-2024 06:42 PM -
Activity Feed for TarekElnaccash
- Tagged Imputation: what to use and when? on SAS Communities Library. 3 weeks ago
- Posted Imputation: what to use and when? on SAS Communities Library. 3 weeks ago
- Tagged Imputation: what to use and when? on SAS Communities Library. 3 weeks ago
- Posted Assessing changes in principal component direction using bootstrap CI's on eigenvector loadings on SAS Communities Library. 01-21-2025 02:43 PM
- Tagged Improving model performance with data scaling on SAS Communities Library. 01-21-2025 02:21 PM
- Tagged Improving model performance with data scaling on SAS Communities Library. 01-21-2025 02:21 PM
- Posted Improving model performance with data scaling on SAS Communities Library. 01-21-2025 02:18 PM
- Tagged How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE on SAS Communities Library. 01-15-2025 02:39 PM
- Tagged How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE on SAS Communities Library. 01-15-2025 02:39 PM
- Tagged How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE on SAS Communities Library. 01-15-2025 02:39 PM
- Posted How to reduce collinearity prior to modeling using PROC VARCLUS and PROC VARREDUCE on SAS Communities Library. 01-15-2025 02:37 PM
- Tagged What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:06 PM
- Tagged What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:06 PM
- Tagged What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:06 PM
- Tagged What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:06 PM
- Tagged What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:06 PM
- Tagged What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:06 PM
- Posted What is collinearity and why does it matter? on SAS Communities Library. 01-13-2025 05:04 PM
- Posted Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off on SAS Communities Library. 10-24-2024 02:39 PM
- Tagged How many principal components should I keep? Part 1: common approaches on SAS Communities Library. 10-24-2024 02:25 PM
-
My Library Contributions
Subject Likes Author Latest Post 0 1 1 1 1
3 weeks ago
Dealing with missing data is a challenge in statistical modeling. Whether data are from a designed experiment or operational data from a business, most data sets have missing values. Missing values are problematic because most analyses discard any row of your data in which a variable used in the model has a missing value. Even with a small number of missing values, this can lead to enormous data loss. Many analysts prefer to impute, that is, to fill in missing values with (hopefully) reasonable proxies. In some cases, researchers will simply impute fixed-values such as a mean or median of the nonmissing values for continuous variables. In other situations, more sophisticated techniques such as cluster imputation, regression imputation, or multiple imputation are used. In this post, I will describe the situations in which different imputation methods are appropriate and demonstrate simple and quick fixed-value imputation techniques that may be appropriate for predictive modelers. In a follow up post, I will demonstrate imputation techniques geared towards statistical inference practitioners.
The default approach to handling missing values: complete case analysis
Most statistical and machine learning models use complete case analysis (CCA, also called listwise deletion), meaning only rows that have complete data for predictors and the response are used in the analysis. For many situations, this will result in an unacceptable loss of data. With even a small percentage missing data, say 1% of the individual values, CCA can sometimes result in 2/3 of the rows being dropped from analyses. Additionally, CCA may bias the results depending on why the data are missing. So why is this the default approach for statistical modeling? When the missing values are missing completely at random (MCAR, described below), complete-case analysis yields unbiased parameter estimates. But even with data being MCAR (a strong assumption about the data), CCA will reduce sample size and therefore increase standard errors, increase the width of confidence and prediction intervals, and reduce statistical power. Under other types of missingness, CCA will typically yield biased parameter estimates.
Imputing missing values
Imputation is often a better alternative to complete case analysis for handling missing data. Unless there is a very small amount of missing data, imputation is generally necessary, particularly if the goal is scoring new data. For predictive modeling, even if the model development data is nearly complete, most models cannot make predictions if the data to be scored is missing any values. So, imputation will likely be necessary. For explanatory modeling, imputation can be a better alternative than CCA when data are missing at random (described below), which can result in biased parameter estimates and inflated standard errors. Despite appropriate imputation methods being beneficial to explanatory modelers, several authors have noted that published research analyses don’t always acknowledge the presence of missing data and how it was handled, suggesting that there is wide-spread uncertainty on how to impute data. This is probably because imputation for inference typically requires more sophisticated techniques such as regression imputation or multiple imputation.
When is imputation unnecessary?
Some models do not need imputation because they do not use CCA. Decision trees and tree-based models such as random forests and gradient boosting have built in ways of handling missing data. Some analysts feel that decision trees are not as interpretable as regression models, but several built in methods for handling missing values might make them a good alternative. The SAS Viya procedure PROC TREESPLIT can be used for fitting decision tree models.
Some models can estimate parameters through maximum likelihood without discarding incomplete rows. They use all the available data to find the parameters that are most likely to have generated the observed data, without actually imputing the missing values. This involves jointly maximizing the likelihoods based on complete rows and the likelihoods based on incomplete rows. PROC MIXED and PROC GLIMMIX can produce these kinds of maximum likelihood estimates when data have missing response variable values but not missing predictors. For data with missing predictors, PROC MI can use an expectation-maximization algorithm to produce maximum likelihood estimates without imputing or discarding data. PROC MI will be discussed in a follow up post on multiple imputation.
When data are missing not at random (described below), bias in analyses based on imputed data (including by multiple imputation) may be larger than biases introduced by CCA.
Appropriate imputation depends on the analysis goal and the missing data mechanism
Is your work focused on understanding the risk factors and their relative importance for Type-2 diabetes? Or is your work focused on predicting who is likely to donate to your charity when they receive a solicitation? The best approach for imputing missing values will differ between these situations. It will depend on your research goal as well as the missing data mechanisms.
So, I’m going to dichotomize the goals of statistical modeling into prediction and statistical inference (also called “explanatory modeling”). Business analysts are typically interested in getting the most accurate predictions possible from their models. Examples might include predicting which transactions are fraudulent, or which customers are likely to churn and cancel services with a particular business. Parameter estimates and p-values are of secondary importance, if they are of interest at all.
Statistical inference practitioners are typically focused on calculating unbiased parameter estimates to explain the relationships among variables. These explanatory modelers are often interested in constructing confidence intervals around these parameter estimates and carrying out significance tests. While prediction may be an eventual goal, that comes much later than parameter estimation and hypothesis testing and is often in service of establishing causal relationships.
Why does this distinction matter for imputation? The purposes of imputation are:
To reduce the amount of data lost due to missing values and CCA.
To reduce the bias in parameter estimates which can occur with certain kinds of missing data.
So, it’s worth considering if the focus of your work is unbiased parameter estimation or pure prediction. Predictive modelers will likely be most concerned with data loss. Explanatory modelers will probably be most concerned with unbiased parameter estimates, but data loss and therefore statistical power may also be an important concern.
There are several common imputation techniques used for these goals. They include:
Fixed-value imputation, such as imputing the mean, median, or mode of the nonmissing values.
Cluster imputation, a type of fixed-value imputation, in which the cluster-specific fixed-value (e.g., mean) is imputed.
Regression imputation which uses regression to predict the missing values of one predictor based on the values of other predictors. The imputed values for a variable are not fixed and could be unique for each observation. This could also be performed using decision trees or other machine learning models.
Hot-deck imputation (and similar approaches such as predictive mean matching) in which similar observations are found, and imputation is based on the nonmissing values of these similar observations.
Multiple imputation in which multiple complete data sets are created with different values imputed in each set. The sets are analyzed in combination using specialized techniques.
The goal of modeling (prediction or inference) along with the type of missingness will determine which of these imputation approaches is a good choice for your work. So next, I’ll explain the types of missing values. Warning: the terminology is non-intuitive and, in my opinion, poorly chosen but unfortunately it is well established. Also, there is usually no good way to know with certainty which of these missingness mechanisms applies to your variables with missing values. You need to know your data well to make good guesses. Keep in mind that the type of missingness for variable-1 in your data may be different than the type of missingness for variable-2, so the missing data mechanisms should be evaluated separately for each variable.
Missing Completely at Random (MCAR). This is the easiest missingness to deal with. MCAR means the chance a value is missing from your data set is not related to its (unknowable) value or to any other variable inside or outside of your data set. The loss of the data is 100% random and losing these data will therefore not bias any parameters (but will reduce sample size and therefore power). For many kinds of data, MCAR may be an unrealistic assumption. Several traditional missing data techniques such as CCA are valid only if the MCAR assumption is met. If the observed data are a random subset of the full data, CCA gives essentially the same results as the full data set would have.
Examples of MCAR
A researcher drops petri dishes from a randomized experiment. Since the dishes were randomized to trays, a purely random sample of experimental units was lost.
A patient in a clinical trial forgot to report symptoms and it has nothing to do with their health status or drug treatment effects.
A manufacturing sensor fails to record data because of a power outage in the building, and the missing values are not related to the production process.
Missing at Random (MAR). This is neither the easiest nor the hardest situation to deal with. MAR means there is a relationship between the missingness and other variables in your data set (but not to unobserved data). In other words, the missingness can be predicted from other variables in your data. In MAR, there is no relationship between the missing values themselves and the chance they are missing (after adjusting for the observed variables). This is often considered more plausible for many kinds of missing data than MCAR. Some researchers have suggested using a t-test comparing the missing value group to the non-missing value group for other continuous predictors as a test for MAR. Logistic regression with missing status of a variable as the target can also be used to detect predictor-missingness associations that suggest data are MAR.
Since MAR is easier to deal with than MNAR (next section), it can be beneficial to have other variables that can help predict the missingness. These variables can be used in various imputation models such as regression imputation or multiple imputation. These imputation techniques (as well as the maximum likelihood and EM algorithm approaches) can reduce parameter estimation bias when data are MAR. These are all good approaches for explanatory modelers with MAR data.
Examples of MAR
In a customer survey, people with the highest reported education level are more likely to not report their income and the missingness of income data is not related to the value of income.
In a clinical trial, participants who have longer commute times to a clinic miss scheduled lab tests more frequently, but the missing lab results are unrelated to their actual health condition.
In an employee satisfaction report, employees in certain departments tend to skip questions about job satisfaction more often, but the missingness is unrelated to their actual satisfaction level.
Women were less likely to report their weight in a survey than men, and you have complete information on gender of the survey respondents.
Missing Not at Random (MNAR). This is the most complicated situation to deal with. In MNAR, the chance a value is missing depends on that value itself or on other variables that are not contained in your data. In other words, there is a definite pattern to the missing values, but that pattern is unrelated to any observed data. This can heavily bias your parameter estimates and SE’s. Imputing for MNAR generally requires using advanced methods that depend heavily on assumptions about the unknown information, and there is no way to verify that these assumptions are correct. Because of this, several analyses can be done with different assumptions to show how sensitive the results are to these assumptions. When data are MNAR, researchers may just need to get new data or more data.
Examples of MNAR
Respondents with highest incomes are least likely to report their incomes and the missingness is uncorrelated with other observed data.
In a clinical trial, patients with severe side effects from a drug drop out of a study, leaving missing data related to the severity of their condition or treatment response.
In an employee satisfaction survey, employees dissatisfied with their new reporting requirements are less likely to respond to the survey. Their missing responses are directly related to their dissatisfaction.
Fixed-value imputation for predictive modelers
For predictive modelers, fixed-value imputation (e.g., mean) is often an acceptable imputation approach. When the data are MCAR, this will bias standard errors downward but will not bias parameters such as the sample mean. Regression coefficients may be biased downward since the response variable was not used in the imputation (this can happen because the imputed mean X can be paired with any value of Y, weaking the X-Y relationship). When data is MAR, mean imputation can lead to biased parameter estimates and can weaken or distort the relationships between variables. It also inflates the confidence in parameter estimates because it doesn't account for the uncertainty introduced by imputing missing data. These problems are worse when data is MNAR.
But when the goal is pure prediction, biased parameters are not as extreme a problem as they are for explanatory modeling. In fact, in predictive modeling, often shrinkage methods are employed such as LASSO and Elastic Net regression that intentionally bias parameters to reduce prediction error. With that in mind, the benefits of fixed-value imputation may outweigh the costs.
Fixed-value imputation is very fast, freeing up the analyst’s time for other parts of modeling or developing additional models. This is not a statistical but a practical advantage. For some business needs, a good model produced today is better than an excellent model produced next month. And while more sophisticated approaches such as multiple imputation may reduce bias in parameters (when data is MAR), these approaches may drastically increase computational power needed for the large data sets typically used in predictive modeling.
Fixed-value imputation can be paired with missing indicator variables. These binary constructed variables are each related to a predictor that is missing values. When X1 is missing a value, the missing indicator for X1 takes a value of 1, otherwise its value is zero. These missing indicators can potentially capture associations between the missingness of the predictors and the target. Missing indicators are created automatically by the SAS Viya regression procedures GENSELECT, LOGSELECT, PHSELECT, QTRSELECT, and REGSELECT when the INORMATIVE keyword is added to the MODEL statement. INFORMATIVE also automatically imputes the mean for missing continuous variables.
This post has focused on imputation for continuous predictors. For nominal predictors with missing values, a common approach is to treat missing as a valid analysis level. So, a 2-level predictor (e.g., gender) with missing values, would be treated as a three-level nominal variable.
How to do fixed-value imputation
There are several ways to do fixed-value imputation in SAS. One approach is to use the SAS 9 procedure PROC STDIZE. The code below shows how to impute the median for missing values of variables var1, var2, and var3. Other options besides median imputation include imputing the mean, midrange, minimum, or values referenced from a separate data set. The REPONLY option causes the procedure to only replace missing values with the median and not to standardize the data.
proc stdize data=work.my_data method=median reponly out= work.my_data_imputed;
var var1 var2 var3;
run;
For imputing in-memory data in SAS Viya, PROC VARIMPUTE is a good choice. Below is some example code to do the trick. The CTECH option is the continuous variable imputation technique. It can also be used to impute the mean, a random value between the minimum and maximum, or a specific value.
proc varimpute data=mycas.my_data;
input var1 var2 var3 / ctech=median;
output out=mycas.my_data_imputed copyvars=(_all_);
run;
Imputed variables in the output data mycas.my_data_imputed will have the prefix IM_ (e.g., IM_var1). I especially like that PROC VARIMPUTE can impute random values uniformly distributed between the minimum and maximum. If I’m missing lots of values for one variable, this approach prevents a big spike at the median, which can be problematic for some models.
Cluster imputation
Cluster imputation imputes the cluster-specific fixed-value, such as a mean, instead of the overall mean. These imputed values may be less biased than the unconditional fixed-value. While there are many approaches to coming up with clusters (e.g., SAS 9: PROC CLUSTER, SAS Viya: PROC KCLUS), here is a simple way of “clustering” using PROC RANK. Let’s say you wanted to find clusters of individuals based on age and years on the job. You could use PROC RANK to find three age-groups, and three years-on-the-job-groups for a total of nine categories or clusters:
proc rank data= work.my_data out= work.my_data_rank groups=3;
var age yearsonjob;
ranks age_group yearsonjob_group; *makes 9 combinations of these groups;
run;
Then you could use either of the two fixed value imputation methods along with a BY statement to impute the cluster:
proc stdize data= work.my_data_rank method=median reponly out= work.my_data_imputed_clus;
by age_group yearsonjob_group;
var var1 var2 var3;
run;
In a follow up post, I’ll demonstrate regression imputation and multiple imputation approaches. For more information on the maximum likelihood estimation that I briefly mentioned, see the SUGI 2012 paper “Handling Missing Data by Maximum Likelihood” by Paul Allison. The book “Regression Modeling Strategies” by F. Harrell has a helpful chapter on analyses with missing data.
- Tarek Elnaccash
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- biostatistics
- CCA
- cluster imputation
- complete case analysis
- data science
- fixed-value imputation
- GEL
- imputation
- listwise deletion
- MAR
- MCAR
- missing at random
- missing completely at random
- missing not at random
- missing values
- MNAR
- multiple imputation
- Proc Rank
- proc stdize
- PROC VARIMPUTE
- regression imputation
- statistics
Labels:
01-21-2025
02:43 PM
1 Like
Principal component analysis (PCA) is a widely used data summarization and dimension reduction technique used with multivariate data. The principal components (PCs) are new variables constructed as weighted linear combinations of a set of original variables. These components, particularly the first few, can summarize the multivariate correlation structure of the original data, usually with far fewer components than the original variable number. Researchers may be interested in determining if the data structure summarized by PCs changes across groups or treatments. This can be assessed by comparison of the direction of components as described by the variable loadings. In this post, I’ll demonstrate how to create bootstrap confidence intervals on PC loadings to test hypotheses about changes in data structure across groups.
Data summarization using PCA
Principal components can summarize the data structure and correlations among the many quantitative variables. The first component explains the most information in the original data set. It represents the direction in the data where most variation exists, i.e. it has the largest variance or eigenvalue of any possible linear combination of the original variables. Because the first few PCs tend to explain most of the variation, often the remaining PCs can be discarded with minimal loss of information. The assumption is that early components summarize meaningful information, while variation due to noise, experimental error, or measurement inaccuracies is contained in later components. For a brief introduction to PCA, please see my previous post How many principal components should I keep? Part 1: common approaches.
If we assume that the first PC summarizes the most meaningful correlation structure in the data, researchers may be interested in determining if the first component changes across groups of interest. Changes in a component can occur in its size or direction. The size of a component is measured by the magnitude of its eigenvalue. The direction of a component can be described by the vector of variable coefficients, also called loadings. This post focuses on assessing changes in directions of PC1, but the methodology could easily be adapted to assessing changes in eigenvalues as well.
There is no commonly used formal test for changes in PC direction, but one can be constructed by putting confidence intervals (CI) on the PC loadings. The idea here is that we want to distinguish meaningful differences in the direction of PC across groups from differences due to random chance. If PC1 loadings for a group A are outside the confidence interval for the PC1 loadings of group B, the pattern of covariance has significantly changed across these groups. These confidence intervals can be constructed through bootstrapping.
Bootstrap confidence intervals
Bootstrapping is a resampling technique often used to construct confidence intervals on statistics of interest. Unlike traditional confidence intervals, bootstrap CIs require no assumptions about the underlying distribution of the data, such as normality. Bootstrap data sets are replicate random samples drawn with replacement from the original data. Some observations will show up several times in a bootstrapped data, others not at all. Resampling is typically done many times, creating say 100-1000 replicate data sets, and the statistic of interest is calculated on each replicate. This produces an empirical distribution of the statistic, and a 95% confidence interval can be constructed by finding the 2.5 th and 97.5 th percentiles of this distribution.
To illustrate creating bootstrap CI on eigenvector loadings, I used the SAS 9 procedure PROC PRINCOMP to analyze Fisher’s famous iris data set (sashelp.iris). These data contain 4 variables measuring floral morphology for 3 species of iris. The variables are sepal length, sepal width, petal length, and petal width. Fifty flowers of each of Iris setosa, Iris virginica, and Iris versicolor were measured for a total of n=150.
Research in iris evolution suggest that I. virginica and I. versicolor are more recently derived species compared to the more ancestral I. setosa. I thought it would be interesting to see if the structure of PC1 has changed in Virginica and Versicolor relative to Setosa in these data. To assess this, I computed bootstrap confidence intervals on the Setosa floral trait PC1 loadings and compared them with the directions of PC1 in the other 2 species. If the loadings of Virginica and Versicolor for PC1 are not within the 95% CI of Setosa, we can say that the structure of the floral trait correlations is significantly different between the species.
First, I subseted the data to the Setosa species. Next, the SURVEYSELECT procedure was used to create 1000 bootstrap samples. The SURVEYSELECT documentation refers to bootstrapping as “unrestricted random sampling” and bootstrapping is achieved through the METHOD=URS option. While many of the options have obvious functions, the OUTHITS option is necessary for getting the correct sample size in each replicate (N=50). Omitting OUTHITS results in one row in the output data per unique row sampled from the input data (i.e., the duplicate rows sampled with replacement were left out).
Eigenvectors were calculated for the 1000 bootstrap replicates by PROC PRINCOMP. PROC UNIVARITE was used to find the lower and upper 95% confidence interval limits which were estimated as the 2.5th and 97.5th percentiles of the distributions for each component. Here is the code:
proc princomp data=sashelp.iris;
by species;
var PetalLength PetalWidth SepalLength SepalWidth;
ods select Eigenvectors;
run;
data setosa versicolor virginica;
/*only used Setosa, but can create CI for any of these species */
set sashelp.iris;
if species="Setosa" then output Setosa;
else if species="Versicolor" then output Versicolor;
else output Virginica;
run;
%let reps=1000;
proc surveyselect data=setosa method=urs reps=&reps seed=12538 out=setosa_reps sampsize=50 outhits;
/*without OUTHITS option, you can get smaller samples than you want */
run;
ods select none;
proc princomp data=setosa_reps;
by replicate;
var PetalLength PetalWidth SepalLength SepalWidth;
ods output Eigenvectors=ev_setosa;
run;
proc sort data=ev_setosa out=ev_setosa_sort;
by variable;
run;
proc univariate data=ev_setosa_sort;
by variable;
var prin1 prin2 prin3 prin4;
output out=SetosaPCAbootCI pctlgroup=byvar pctlpts=2.5 97.5
pctlpre=prin1_ prin2_ prin3_ prin4_ pctlname=P025 P975;
run;
ods select all;
proc print data=SetosaPCAbootCI;
title 'Setosa bootstrap 95% CI on loadings';
var variable prin1_P025 prin1_P975;
run;
Below are the PC1 loadings for the three iris species. Keep in mind that 4 PCs are constructed for each species, but we are only viewing the first eigenvectors.
And here are the lower and upper confidence interval limits for the variable loadings of the Setosa PC1:
What we see above is that the PC1 has a different direction in both Versacolor and Virginica than PC1 in Setosa. The loadings for petal length for Versacolor (0.53) and Virginica (0.55) are each outside the 95% confidence interval of Setosa (0.02, 0.51). Versacolor is significantly different from Setosa in all the loadings while Virginica differs in 2 of the 4 loadings.
Another thing we can see from the Setosa confidence intervals is that none of the intervals contain zero. This means that each variable contributes significantly to the first principal component. If all confidence intervals contained zero, that is, if there were no significant loadings, then we could conclude that this component was entirely due to noise and the axis is not meaningful. If only one Setosa loading was significant, the PC axis would not be summarizing correlation structure among variables and again is not meaningful. In such a situation, the component would depend on only a single variable. This can happen when the PCA is based on covariances instead of correlations and the original variables have very different scales.
Use caution with this approach to generating confidence intervals for eigenvector loadings. Two situations can complicate interpretation of the confidence intervals. First, if two components (e.g., PC1 and PC2) explain a similar amount of variability, they can swap positions in bootstrap replicates. So, the confidence intervals on the loadings for PC1 would be based on some PC2 that have been mixed in among the replicates. I imagine this could be more likely if the PC are summarizing more noise than signal, so make sure to address which PC axes are significant before testing for changes in direction. For an approach to determining which compont axes are significant, please see my previous post How many principal components should I keep? Part 2: randomization-based significance tests.
Second, it is possible that a component can reverse direction in replicates, essentially having the nearly the same loadings in magnitude, but multiplied by negative one. Reversals would increase the chance that the loading CIs include the value zero, despite the eigenvector pointing in the same direction. If this is a concern, searching for negative correlations between bootstrap replicate components and the original PC could identify if this problem exists.
It is not difficult to assess changes in PCA loadings through SAS programming. The bootstrapped CIs created here are not limited to use in PCA; they can be used for nearly any statistic without making distributional assumptions. Hopefully this approach will be a useful tool to have in your statistical toolkit.
Further reading
My previous posts on determining how many PCs to retain for analyses:
How many principal components should I keep? Part 1: common approaches
How many principal components should I keep? Part 2: randomization-based significance tests
Rick Wicklin has a great series of SAS blogs on bootstrapping here: The essential guide to bootstrapping in SAS - The DO Loop
For more information on PCA and other multivariate techniques, try the SAS course Multivariate Statistics for Understanding Complex Data. You can access this course as part of a SAS Learning Subscription.
Find more articles from SAS Global Enablement and Learning here.
... View more
01-21-2025
02:18 PM
1 Like
Data scaling (transforming variables so they fit within a specific range or distribution) can improve the performance and interpretability of machine learning and statistical models. Several methods of scaling exist, including standardization, normalization, and midrange scaling. Many supervised and unsupervised machine learning models benefit greatly from variable scaling. Explanatory modelers can scale predictors to facilitate comparisons. For predictive modelers, the variable scaling methods that improve model fit and accuracy most are data dependent. In this post I'll describe several methods of scaling, and I’ll demonstrate empirical assessment of various scaling methods when applied to a logistic regression model using LASSO selection in SAS Viya.
Scaling for machine learning models
Many data sets contain variables that vary greatly in their units and range. For example, when trying to improve the yield of a chemical process, the temperature of the reaction may be in the hundreds while the pressure of that reaction might be in the range of 1 to 2 atmospheres. Several machine learning models are sensitive to differently scaled inputs. For these models, transforming variables to a similar scale is crucial for model fit and predictive accuracy. Differently scaled variables can result in slow model training, collinearity caused by scale differences, and imprecise parameter estimates. Often, the inputs with the greatest range of values will have the largest impact on the model, particularly with models based on distance calculations. Ensuring the inputs are on a similar scale will avoid many of these problems.
Scaling for explanatory modeling
Scaling can be beneficial for model interpretability. Researchers frequently want to compare the effects of different predictors on a response variable Y. How should this be done when the predictors often have different units? One approach is standardization, which converts predictor values into units of standard deviations from the mean. When using standardized variables in regression, the partial regression coefficients will describe the average change in Y for a one standard deviation (SD) change in the predictor. This is not only true for linear regression. In Poisson and Logistic regression on standardized predictors, coefficients represent the average effect on log counts or on the logit, respectively, per 1 SD change in X. If the research goal is to determine which regressor has the greatest impact on Y, one way to answer this question is to report the variable with the greatest magnitude of the standardized regression coefficient.
Additionally, standardized variables are centered on zero. In regression on unscaled variables, the intercept is often not interesting, either because it represents an unrealistic value (e.g., X=0 never occurs) or because it's outside the range of the data. When variables are standardized, the intercept refers to the mean value of Y when X equals the mean (X), instead of when X=zero. Now a regression, say of human height on weight, has an intercept that describes the average height for someone at the mean weight instead of the average height of someone who weights zero pounds.
Explanatory modelers may scale measures of variability as well. The coefficient of variation (CV) is a mean-scaled standard deviation. This unitless measure of variability allows comparison of variability across variables as different as automobile accidents at city intersections and the number of petals on a typical sunflower.
For the rest of this post, I’ll focus on scaling for machine learning and predictive models.
Methods of variable scaling
Three common methods of variable scaling are standardization, normalization, and midrange scale scaling. Standardization involves subtracting the mean from each value and dividing by the standard deviation of the variable. This is often most useful for variables that are normally distributed. Standardization typically compresses normally distributed variables to values ranging between −3 to +3. Extreme outliers can still be well beyond this range, so if models are very sensitive to outliers, other scaling methods may be a better choice.
Normalization can have several meanings, the most common in machine learning modeling is to scale variables to a minimum value of 0 and a maximum value of 1. This works well for data that do not follow a normal distribution. It results in all features being scaled to the same range, ensuring fair treatment of each predictor in machine learning models. This can be important for models that are sensitive to the magnitude of the input data. Note that despite the name, this type of scaling does not make a variable follow the normal (aka Gaussian) distribution. Normalization is done by subtracting the minimum value from each observation and dividing by the range (the maximum value minus the minimum value).
Midrange scaling is a variation of normalization. It scales variables to a minimum value of −1 and the maximum of +1. This scaling is often used for neural network models. For example, the SAS Viya NNET procedure defaults to midrange scaling of all inputs before fitting a neural network. Why is this preferred to normalization? A commonly used activation function in neural networks is the hyperbolic tangent, tanh. Tanh transforms the output from the hidden units (derived nonlinear functions of the predictors) to the range (−1, +1). Midrange scaling more closely aligns the inputs with the output of this function, which can facilitate the estimation of the neural network weights. Midrange scaling subtracts the midrange from each value, then divide by half the range. The midrange is half the sum of the maximum and minimum values.
, where midrange = (x min + x max )/2
Which models are sensitive to variable scaling?
Principal component analysis – Principal component analysis is typically done on scaled variables (correlations) rather than unscaled data (covariances). If covariances are used and inputs have greatly different scales, principal components tend to be dominated by the variable with the largest range and are less useful for data summarization and data reduction.
Neural networks – Neural Networks (NNs) have a lot of parameters and thus lots of local error minima likely exist. Putting the variables on the same scale helps speed the optimization and reduces the chance a locally optimal set of weights are found. Using unscaled variables can cause the NN to take many more iterations to converge or sometimes result in failure to converge entirely. In mathematical terms, the optimization process looks for gradients that indicate the steepest change in the direction of reduced error. NNs with unscaled data may be slower because the gradients are largest in the direction of inputs with larger scales than inputs with smaller scales.
K-means clustering, k-nearest neighbors, and support vector machines – What these models have in common is that they involve calculating distances between data points to determine similarity. Distance-based algorithms are sensitive to variable units, so it is common to scale data so that all the predictors can contribute equally to the result.
Generalized linear models and mixed models – This includes logistic regression, Poisson regression, random coefficient models, and many others. What these all have in common is that they use maximum likelihood estimation. While scaling is not required for maximum likelihood estimation, this iterative process can sometimes proceed more smoothly when variables have a similar range. Particularly for complex models, scaling can sometimes fix convergence problems by making the optimization process more stable. The increased stability is because scale differences can cause the algorithm to take more chaotic steps on its way to the optimum.
Which models are do not require variable scaling?
Tree-based models – Decision Trees, Forests, and Gradient Boosting don’t require variable scaling. These tree-based models use hierarchical splitting and are not based on distance calculations and do not use maximum likelihood estimation. In general, tree-based analyses are not sensitive to outliers and do not require variable transformations.
An example of variable scaling for logistic regression using LASSO selection
Next, I’ll demonstrate different methods of scaling on the outcome of a logistic regression model using LASSO selection. I analyzed a banking data set using various types of scaling with the goal of predicting customers likely to purchase an insurance product. These data are a modified version of the develop.sas7bdat data, available in SAS Viya For Learners. The target variable INS is a binary variable indicating 1 for purchase and 0 for no purchase and there were initially 25 inputs. These inputs were relatively uncorrelated due to using the SAS Viya procedure PROC VARREDUCE to perform unsupervised variable selection (used through the Variable Reduction task in SAS Studio) from a larger pool of starting predictors. I then used LASSO selection to find a logistic regression model with the lowest validation misclassification rate.
Least absolute shrinkage and selection operator (LASSO)
LASSO uses a version of least squares estimation which minimizes the error plus a function of the sum of the regression coefficients. This has a result of shrinking the slopes, in some cases down to zero, making the corresponding predictor drop out of the model. Shrinkage also makes the model less sensitive to the training data and reduces the chance of overfitting. The idea is that LASSO adds a small amount of bias to predictions by shrinking coefficients relative to their unbiased estimates to subtract a larger amount of prediction variance, due to the bias-variance trade-off. This optimizing of bias and variance ideally increases the model’s overall predictive accuracy. For a general review of the bias-variance tradeoff see my previous post Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off. For more information on the bias-variance trade off in relation to LASSO, see the introductory SAS data science course “Statistics You Need to Know for Machine Learning”.
Variable scaling using PROC STDIZE
I started by using PROC STDIZE to scale the data sets develop_train and develop_valid. Method=STD, METHOD=RANGE, METHOD=MIDRANGE options perform standardization, normalization, and midrange scaling, respectively. Note that scaling the training data using OUTSTAT=train_std and the validation data using METHOD=IN(train_std), results in the validation data being adjusted using the means and standard deviations from the training data, not their own means and standard deviations. This prevents information leakage in the assessment statistics from the validation performance. For information on information leakage in the context of imputing missing values, please see my previous post How to Avoid Data Leakage When Imputing for Predictive Modeling.
The training data and validation data were then combined because PROC LOGSELECT, like many other SAS Viya supervised modeling procedures, looks for training and validation data within the same data set.
Here is the code:
%let varreducelist=BIN_DDABal branch_swoe IM_CCBal DDA MMBal ILS MTG Checks NSF IM_LORes LOCBal CDBal log_IM_Income
IRA IM_CRScore Sav ATMAmt Moved CashBk IM_AcctAge SDB IM_POS SavBal InArea IM_Age IM_CC Teller IRABal DirDep DepAmt
IM_HMVal CD ATM MMCred LOC;
/* standardize training inputs for LASSO */
proc stdize data=mycas.develop_train
out= mycas.develop_train_std method=std outstat=train_std;
var &varreducelist;
run;
/* standardize validation inputs using training means and training std */
proc stdize data= mycas.develop_valid
out= mycas.develop_valid_std method=in(train_std);
var &varreducelist;
run;
/* combine training and validation */
data mycas.develop_final_std;
set mycas.develop_train_std mycas.develop_valid_std;
run;
/* normalize training inputs for LASSO */
proc stdize data= mycas.develop_train
out= mycas.develop_train_norm method=range outstat=train_norm;
var &varreducelist;
run;
/* normalize validation inputs using training adjustments */
proc stdize data= mycas.develop_valid
out= mycas.develop_valid_norm method=in(train_norm);
var &varreducelist;
run;
/* combine training and validation */
data mycas.develop_final_norm;
set mycas.develop_train_norm mycas.develop_valid_norm;
run;
/* midrange scale inputs for LASSO */
proc stdize data= mycas.develop_train
out= mycas.develop_train_mid method=midrange outstat=train_mid;
var &varreducelist;
run;
/* midrange scale validation inputs using training adjustments */
proc stdize data= mycas.develop_valid
out= mycas.develop_valid_mid method=in(train_mid);
var &varreducelist;
run;
/* combine training and validation */
data mycas.develop_final_mid;
set mycas.develop_train_mid mycas.develop_valid_mid;
run;
Next, I used the SAS Viya procedure PROC LOGSELECT with the METHOD=LASSO option to perform LASSO selection. The CHOOSE=VALIDATE option picks the best model from the sequence as the one that has the lowest validation average squared error. A macro was used to save some typing and apply LASSO selection to the three differently scaled data sets as well as the unscaled data.
Note that I used SAS Viya 2022.09 LTS for this demonstration. Starting in SAS Viya 2023.03 LTS, the default behavior of PROC LOGSELECT was to internally scale and center data used in LASSO selection. To turn this off, add the CENTERLASSO=FALSE option to the MODEL statement.
Here is the code:
%macro scaledLASSO (data=);
proc logselect data=&data association;
partition role=_PartInd_ (validate='0' train='1');
class res;
model Ins(event='1')= res &varreducelist/ link=logit;
selection method=lasso(choose=validate);
run;
%mend scaledLASSO;
%scaledLASSO (data=mycas.develop_final_std);
%scaledLASSO (data= mycas.develop_final_norm);
%scaledLASSO (data= mycas.develop_final_mid);
%scaledLASSO (data= mycas.develop_final);
Here is a summary of fit statistics from the different variable scaling approaches:
Variable scaling
standardization
normalization
midrange
none
Inputs in final model
31
24
23
4
Validation misclassification
0.2714
0.2832
0.2838
0.3413
Validation ASE
0.1789
0.1845
0.1848
0.2194
Validation AUC
0.7781
0.7607
0.7596
0.7013
With these data, standardization performed the best, followed by normalization, then negligibly worse midrange scaling. Is standardization a generally a better scaling approach for LASSO? No, the best scaling for prediction is an empirical question and will often need to be determined through trial and error. I did not have a priori knowledge that standardization would perform best. While the differences in misclassification among scaling approaches were small, all variable scaling approaches performed considerably better than the unscaled data. Scaling is often important, and the best method depends on your research goals and your data.
For more information on many of the topics covered in this post such as variable scaling, the bias-variance trade-off, LASSO selection, and neural network models, consider taking the SAS class “Statistics You Need to Know for Machine Learning”.
Further reading:
Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off
How to Avoid Data Leakage When Imputing for Predictive Modeling
Course: Statistics You Need to Know for Machine Learning
Find more articles from SAS Global Enablement and Learning here.
... View more
01-15-2025
02:37 PM
1 Like
In my previous post What is collinearity and why does it matter?, I described collinearity, how to detect it, its consequences, and ways to reduce its impact. One method of reducing collinearity is to remove redundant predictors. In this blog, I’ll demonstrate two approaches for variable reduction which can remove collinearity prior to regression modeling.
Collinearity, also called multicollinearity, means strong linear associations among sets of predictors. Its presence can increase the variance of predictions and parameter estimates and can make the estimates unstable. The problems associated with collinearity can be avoided by removing redundant variables prior to regression modeling. Redundancy here means that the variables provide much of the same information and that they are typically (but not always) highly correlated.
Removing redundancy can be helpful even in the absence of severe collinearity problems, so the methods described below can be useful in other contexts. For example, predictive modelers often remove redundant predictors to avoid the curse of dimensionality, which refers to the problems that arise when the number of predictors is large relative to the number of observations. When the dimensionality of the data is large, even enormous data sets may be sparse and lead to unreliable modeling results.
The two approaches that I’ll demonstrate for removing redundancy and therefore collinearity involve variable clustering using the SAS 9 procedure PROC VARCLUS and variable reduction using the SAS Viya procedure PROC VARREDUCE.
For both demonstrations, I’ll use a banking data set (develop_training) with 61 numeric predictors and a binary target. The target, ins, indicates whether a customer purchased a variable annuity in response to a marketing campaign. A version of these data is available in SAS Viya for Learners. These data show strong collinearity (VIF>10). Variance inflation factors, calculated by PROC REG, show 5 predictors with VIF between 10 and 22 and 5 predictors with VIF between 45 and 63.
Reducing redundancy through variable clustering (SAS 9)
The first approach involves finding clusters of correlated variables using PROC VARCLUS. Each cluster consists of variables that have high correlations with members of their cluster but low correlations with members of other clusters. Once clusters are formed, the analyst can manually throw out all but one predictor of each cluster. So, 100 predictors grouped into 37 variable clusters would result in reduction to 37 predictors for modeling.
Let’s walk through the code, which is modified from a demonstration in the SAS class Predictive Modeling using Logistic Regression. The procedure starts with all the variables in a single cluster. It then performs principal component analysis (PCA), and if the second eigenvalue is greater than a threshold, the cluster is split into two clusters of variables. This is repeated until the second eigenvalue for all clusters is below the user-chosen threshold.
What is the logic behind using the second eigenvalue here? Here is an illustration to help understand. On the left is pictured correlated data showing heights and weights for a group of people, while on the right is relatively uncorrelated data showing heights and incomes of the same people. For each data set, PCA finds PC1, a new variable that is a linear transformation of the two variables that has the greatest variance. A second new variable, PC2, is created so that it explains the second greatest proportion of variation in the original variables and is perpendicular to PC1. The eigenvalues (variances) of PC1 and PC2 are λ 1 and λ 2 respectively.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In the left picture, λ 2 is small which indicates height and weight are correlated and thus belong in the same cluster of variables. In the right picture, λ 2 is large, indicating there are two (or more) relatively uncorrelated variables. These will get split into two variable clusters and the process repeats until all the second eigenvalues are below the chosen threshold. Each resulting cluster of variables is further split whenever the maximum second eigenvalue is above the chosen cutoff.
The default maximum second eigenvalue for PROC VARCLUS is 1. Several analysts believe that 1 reduces the number of predictors too much and prefer starting with 0.7. Smaller values will result in more variable clusters and thus less reduction in the number of predictors. I’ll show the effect of changing this setting later in this post.
PROC VARCLUS produces a lot of output that isn’t currently needed. So ODS SELECT turns off the printed report and ODS OUTPUT saves the relevant output to SAS data sets. The HI option performs hierarchical clustering. Basically, this prevents variables from switching clusters during splitting. The summary table is then printed to see how many clusters were created during the final splitting.
ods select none;
ods output clusterquality=work.summary /* last row has final # of clusters (=37) */
rsquare=work.clusters; /* only print the 37 clusters solution */
proc varclus data=work.develop_training
maxeigen=.7 hi; /* start with maxeigen=.7, can change later */
var &inputs;
run;
ods select all;
title "Variation Explained by Clusters";
proc print data=work.summary label;
run;
Here is a partial printout of the summary table:
We can see that the variables were split into 37 clusters of variables, with 92% of the variability can be explained with these clusters. This will result in a reduction from the original 61 predictors to 37.
What if 37 is more predictors than desired? You could go back and change the maximum second eigenvalue to a larger number. For example, if you set MAXEIGEN=1, splitting would continue until there were 19 clusters. You can see this in row 19 of the table, where the maximum second eigenvalue first decreases below 1, when it takes the value 0.999772. This corresponds to 19 variable clusters.
Now that we know that there are 37 clusters, we can look at the cluster memberships and pick one predictor from each to keep. Printing out work.clusters where NumberOfClusters = 37 will show cluster membership for the final cluster solution:
proc print data=work.clusters noobs label split='*';
where NumberOfClusters=37;
var Cluster Variable RSquareRatio VariableLabel;
label RSquareRatio="1 - RSquare*Ratio";
run;
What criterion should be used for picking one predictor per cluster to keep? Here are several suggestions:
Use subject matter expertise to keep the variable that is that is theoretically meaningful or known to be an important predictor.
Keep the predictor with the strongest association with the target. The right tool for assessing the association depends on the type of predictor and target. When both are continuous, use Pearson correlation. For two categorical variables, you can use Cramer’s V or odds ratios (only for 2-level variables). For a binary target and a continuous predictor, you can use biserial correlations. See this note for how to calculate biserial correlations in SAS: 24991 - Compute biserial, point biserial, and rank biserial correlations.
Keep the predictor with the higher quality data. For example, if other cluster variables have lots of missing values, keep the variables that have more complete information.
Keep the more reliably measured variables. It’s easier to get an accurate measure of income or number of vacation days than job satisfaction. Assuming there is no theoretical reason to choose one over the other, the more reliable variable may be a better one to keep.
Keep the variable with the smallest 1-R 2 A smaller number indicates a variable is a better cluster representative (more correlated within cluster and less correlated across clusters), all else being equal. I think of this as a last resort, and prefer any of the previous criteria, all of which require knowing your data.
Did this reduce collinearity? Yes. Using PROC REG to calculate variance inflation factors, I found the 37 predictors I retained all had VIF < 2.6.
Reducing redundancy using PROC VARREDUCE (SAS VIYA)
The second approach I’ll demonstrate uses the SAS Viya procedure VARREDUCE to reduce redundancy and therefore collinearity. This procedure can do two kinds of variable reduction: supervised and unsupervised. Supervised variable reduction involves retaining variables based on their association with the target. Unsupervised variable reduction retains variables based on the proportion of the variability in the original predictors that they can explain. I’ll use the unsupervised approach for reducing collinearity.
Let’s go through the code I used below. Unlike VARCLUS procedure, PROC VARREDUCE has a CLASS statement and can read in categorical variables. The categorical variables in the develop_training data were left out to use the same predictors as in the first demonstration. The REDUCE statement uses the UNSUPERVISED keyword to do unsupervised variable selection. The VARIANCEEXPLAINED option instructs the procedure to retain enough variables such that at least 90% of the variability in the original predictors can be explained. The MINVARINACEINCREMENT option only retains variables that increase the proportion of the original variance explained by at least 1%. This option can take values between (0, 1), with a default value of 0.001 (that is, 0.1% of the predictor variability).
Below is some of the PROC VARREDUCE output:
In the table above, the predictor IM_CCBAL is retained first because it explains the greatest proportion of variability of the predictors. The second variable added to this list of variables to keep is Dep, because it results in the greatest increases to the proportion of variance explained. This process reduces collinearity because once a predictor enters this “keep list”, any highly collinear predictors will not appreciably increase the proportion of variance explained and thus get left off this list.
Did this reduce collinearity? Yes. The reduced set of 36 predictors shown above has VIF < 2.3 for all predictors.
Conclusion
I hope these demonstrations will help you in dealing with collinearity problems or even just having too many predictors in your data. Variable reduction is only one approach to reducing the effects of collinearity. If all predictors are theoretically important and none can be dropped, other approaches will be required. In these situations, biased regression techniques such as ridge regression may be a better choice. I will describe and demonstrate ridge regression in my next post.
Links
For an explanation of collinearity and its consequences please see my previous post: What is collinearity and why does it matter?
If you’re interested in learning about principal component analysis, please see How many principal components should I keep? Part 1: common approaches .
For an excellent class on developing predictive models try out the SAS course Predictive Modeling Using Logistic Regression.
Find more articles from SAS Global Enablement and Learning here.
... View more
01-13-2025
05:04 PM
1 Like
Collinearity, also called multicollinearity, refers to strong linear associations among sets of predictors. In regression models, these associations can inflate standard errors, make parameter estimates unstable, and can reduce model interpretability. In this post, I’ll explain collinearity, when it matters, how to detect it, and strategies to mitigate its undesirable effects. In a follow up post, I’ll show examples of variable reduction approaches to reduce collinearity prior to modeling.
What is collinearity?
Let’s go through the parts of the above definition for collinearity. Linear association means that as one variable increases, the other changes as well at a relatively constant rate. When collinear variables are graphed together, they tend to fall on a straight line. The term predictors in the definition tells us that collinearity is an unsupervised concept. It does not involve a response variable, only the predictors. And sets of predictors tells us that collinearity is not limited to associations between only two variables.
Collinearity can cause a variety of problems as described later in this post. It is worth noting that collinearity is not a violation of the assumptions of regression models (i.e., e~ i.i.d. N(0, σ 2 )). Regardless, collinearity should be assessed along with model assumptions as its presence can also cause modeling problems.
What causes collinearity?
Essentially all real-world data has some degree of collinearity among predictors. These associations exist for a variety of reasons. A common cause of collinearity is having predictors that measure the same underlying thing (sometimes called a latent variable). For example, measures of runners’ best 5K race times, resting heart rates, VO 2 max, etc. are likely all correlated because they measure the same underlying (and difficult to directly measure) variable, cardiovascular-fitness. Sometimes the predictors measure percentages of a whole, such as percent of income spent on housing and percent of income spent on non-housing expenses. Predictors like these that add up to 100% will necessarily be correlated since increasing one requires decreasing others. A third reason for collinearity is dumb luck. It may be that the sample you are analyzing has restricted ranges for several variables that result in linear associations. These associations might not exist in other samples from the population. This has implications for predictive modelers, who might be faced with different patterns of collinearity in their training and validation data from the data they intend to score with their model. Predictive models have reduced performance when the patterns of collinearity change between model development and scoring.
What are the effects of collinearity on explanatory regression models?
Collinearity inflates the variance of parameter estimates, making it less likely to find significant effects, even for important predictors. It also makes it harder to get good parameter estimates due to instability, which is reflected in the increased width of the confidence intervals. Instability here means that small changes in the development data can result in the large changes in the magnitude of the estimates and even changes in their sign. If the sign of the parameters is opposite from what is expected based on theory, explanation and interpretation becomes difficult.
Partial regression coefficients are attempts at estimating the effect of one predictor while holding all other predictors constant. But with strong collinearity, the data doesn’t show the effects of one predictor across the full range of the other predictors. So, the coefficients are trying to estimate something beyond what the data shows. Because of this, there are multiple possible ways to assign variance in Y to the predictors that fit the data almost equally well. This is why the standard errors of the parameter estimates are so large.
What are the effects of collinearity on predictive regression models?
Collinearity is generally more of a problem for explanatory modeling than predictive modeling. It will not reduce the predictive power of the model overall, but it will affect estimates of the individual parameters. Sometimes a highly significant model will have all non-significant individual predictors due to collinearity. These models can still make reliable predictions. Collinearity also inflates the variances of the predicted values. This can be especially severe if the values of X’s used to make predictions are outside the range of the development data.
Additionally, collinearity causes automated variable selection methods to perform poorly. Stepwise methods (forward, backwards, and stepwise selection) can result in a random choice of which of the collinear variables enter the model. The specific variables that enter and the order of entry can alter the trajectory of the variable selection process. This may limit the process from finding the set of predictors that produces the lowest validation error.
The problems of unstable magnitude or sign can sometimes be important for predictive modelers too. If the model needs to be explained or interpreted, having theoretically implausible parameters will reduce the explainability of the model.
Detecting collinearity
As mentioned previously, a significant overall model with all non-significant predictors is usually a sign of collinearity. But typically, it is not obvious if collinearity exists in a particular data set. Three common approaches for detecting collinearity are calculation of Pearson correlation coefficients, variance inflation factors, and condition index values.
Correlation coefficients are a good first approach but should not be the sole method for assessing collinearity. This is because correlations cannot identify collinearity that exists between 3 or more predictors. For example, imagine a data set with test scores for a high school class: Exam 1 (E1), Exam 2 (E2) and Final Grade (F), with F calculated at the mean of E1 and E2. There might only be weak associations between any pair of E1, E2, and F, but the data have perfect collinearity. Knowing any two of these three predictors give you perfect information about the other variable.
A correlation coefficient of r= 0.8–0.9 or higher can indicate a collinearity problem. The SAS 9 procedure PROC CORR and the SAS Viya procedure PROC CORRELATION can calculate Pearson Correlation coefficients. The Pearson correlation, r is calculated as follows:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Variance inflation factors (VIF) can be calculated by regressing each predictor on the set of other predictors to be used in a model and plugging in the coefficient of determination (R 2 ) into the following equation:
Each predictor will have its own VIF. So if regression of a predictor on the remaining predictors produces R 2 =0.9, this corresponds to a variance inflation factor of 10. VIF=10 is considered an indication of a serious collinearity problem that should be addressed. Some researchers consider VIF>8 to be problematic. Variables with high VIF can be considered to be causing the collinearity problem. As the name suggests, VIF describes the factor by which the parameter’s variance is inflated compared to a model with no collinearity.
VIF can be computed using the VIF option in the MODEL statement of either PROC REGSELECT (SAS VIYA) or with PROC REG (SAS 9). An example of calculating VIF with PROC REG for the variables in SASHELP.IRIS shows petal length and petal width are collinear:
This is not surprising as these variables have a strong correlation of r=0.96. If petal length is dropped, all remaining VIF scores become < 4. While these data contain two collinear variables which could have been detected with correlations, VIF has the advantage of looking for relationships among three or more predictors.
Condition index values involve principal component analysis (PCA). For a brief description of PCA including a discussion of determining the number of components to use please see my previous post: How many principal components should I keep? Part 1: common approaches. Calculating condition index values requires computations of the eigenvectors (i.e., principal components, PCs) and eigenvalues (variances of PCs) of the sums of squares and cross products matrix X`X, with X being the design matrix of predictors. The condition index values are the square root of the ratio of the first eigenvalue to each eigenvalue. Since the last eigenvalue will always be the smallest, it will have the largest condition index value.
Here is an illustration to help think about condition index values. On the left is pictured correlated data showing heights and weights for a group of people, while on the right is relatively uncorrelated data showing heights and incomes of the same people. For each data set, PCA finds PC1, a new variable that is a linear transformation of the two variables that has the greatest variance. A second new variable, PC2, is created so that it explains the second greatest proportion of variation in the original variables and is perpendicular to PC1. The eigenvalues (variances) of PC1 and PC2 are λ 1 and λ 2 respectively.
In the correlated data, λ 1 is much bigger than λ 2 , so the square root of the ratio of λ 1 / λ 2 (the condition index) will be a large number. This large number is basically indicating the data is spread along a line, with a long direction and a narrow direction in space.
Compare this with the uncorrelated data. The data looks more like a ball than a line. λ 1 and λ 2 here have similar magnitudes, so the condition index will be close to 1. A small condition index indicates no collinearity, with the data points spreading out randomly in space. Again, although demonstrated with two variables, this has the advantage of being applicable to three or more predictors.
Condition index values can be calculated by using the COLLIN and COLLINOINT option in the MODEL statement in PROC REG. COLLIN calculates condition index values for each column of the design matrix X including the intercept column. THE COLLINOINT removes the intercept from the calculations and will report one fewer condition index. Some authors suggest only using COLLIN if the intercept has physical reality in the sense that it is a possible value and within the range of the data. Otherwise COLLINOINT is recommended.
Some authors recommend condition index (CI) values >100 as indicating a severe collinearity problem, others recommend using CI>30. Bear in mind that these cutoffs are as arbitrary as using p<0.05 to assess significance. Once high condition index value principal components have been found, variables that have more than half their variability associated with the high valued components can be considered to be causing the collinearity problem.
Below is an example of the COLLINOINT table from PROC REG for the variables in SASHELP.IRIS. It shows the 4 th PC (last row) has a condition index of 11, and 99% and 82% percent of the variability of petal length and petal width respectively are associated with this fourth principal component. A CI of 11 is well under the commonly used threshold for concern of 100. Condition index values excluding the intercept tend to be lower than when the intercept is included. The condition index for the 5 th PC when the intercept was included = 51.
How can we check for collinearity involving categorical predictors?
The SAS Viya procedure REGSELECT can model categorical predictors as well as continuous predictors. The VIF option in the MODEL statement will calculate variance inflation factors for all the parameters including the dummy variables that are created for each categorical predictor.
The SAS 9 procedure PROC GLMSELECT can export the design matrix X, which includes dummy variables for any categorical predictors. These data can be read into PROC REG which can be used for calculating VIF and condition index values for the categorical and continuous predictors.
Dealing with collinearity
Many of the methods of dealing with collinearity are attempts to reduce predictor variances. Most of these involve either variable selection (removing redundant variables) or modified parameter estimates (e.g. biased regression methods). Both types of approaches can be applied at the same time. Here are 6 approaches to consider for your data.
Increase sample size
Often data collection is done and finalized before analysis so this may not be an option for many people. But if possible, collecting more data will likely reduce the standard errors of the parameters which is the main problem of collinearity. Further, if the collinearity is due to dumb luck of the sample having limited ranges of predictors, the collinearity may disappear in a larger sample. Increasing the sample size will usually decrease standard errors and make it less likely that results are some sort of sampling fluke.
Remove predictors
Variables with high VIFs (or a set of dummy variables for a categorical predictor) can be dropped one at a time until VIF or condition indices go below your chosen threshold. These can be considered redundant variables. Keep in mind that for explanatory modeling, dropping variables is equivalent to assuming the coefficients are zero for these predictors. Variable reduction can be done prior to model fitting using the SAS Viya procedure PROC VARREDUCE. I will describe this and other variable reduction approaches in my next post.
Recoding variables
Sometimes collinear variables can be combined into a single new predictor that will have lower variance. A related approach is to run a factor analysis on collinear predictors and to use the resulting factor scores as predictors. If combining variables does not make sense, another strategy is to recode the predictors to reduce their correlation. For example, imagine data in which the variables garbage production and power consumption in households are highly correlated. The researcher believes that the correlation is because households with more people have will have higher values of both. They can try redefining these variables as garbage production per person and power usage per person in the household to break the correlation. Hopefully these modified variables are similar enough to the originals to answer the same research questions
Biased regression techniques
What if all the predictors are theoretically important and none can be reasonably thrown out solely to reduce standard errors? In this case, biased-regression techniques such PCA regression or ridge regression can be used as they retain all the predictors. PCA produces new variables in equal number to the original variables that are all perfectly uncorrelated, meaning there is no collinearity. These uncorrelated derived variables are the predictors in PCA regression. Ridge regression involves shrinking regression coefficients towards zero which decreases the variability of the model’s predictions. Shrinkage adds a small amount of bias which causes a larger reduction in variance due to the bias-variance trade-off. For a review of the Bias-Variance trade-off, please see my previous post Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off.
Centering
This could be included under “recode variables”, but it is a simple fix for a common cause of collinearity, so I decided to list it separately. When a fitting a model with polynomial (e.g. Y= X+X 2 +X 3 ) or interaction effects (y = X1 + X2+ X1*X2), you can expect collinearity between the lower order term and the higher order term that contains it. This kind of collinearity can be reduced by mean-centering the continuous predictor prior to calculating the polynomial or interaction. SAS statistical procedures with an EFFECT statement such as the LOGISTIC, GLIMMIX, and GLMSELECT procedures can center polynomials using the EFFECT statement option STANDARDIZE (METHOD=MOMENTS)=CENTER. Centering variables for use in interactions can be done using the PROC STDIZE option METHOD=MEAN.
Do nothing
As mentioned previously, multicollinearity is not a violation of the assumptions of regression. Sometimes the best thing to do is just to be aware of the presence of multicollinearity and understand its consequences. High VIFs can often be ignored when standard errors are small relative to parameter estimates and predictors are significant despite the increased variance. And again, if the goal is solely prediction, the effects of collinearity are less problematic.
In my next post, I will show how to remove collinearity prior to modeling using PROC VARREDUCE and PROC VARCLUS for variable reduction.
Find more articles from SAS Global Enablement and Learning here.
... View more
10-24-2024
02:39 PM
1 Like
As you probably know, there is abundant need for big data analysis and machine learning modeling these days. One concept that is important in learning big data analysis is the bias-variance trade-off. In this blog, I’ll describe and illustrate the bias-variance trade-off and explain how understanding this concept can simplify your learning experience by giving you a framework to view hyperparameter tuning in your machine learning models.
Most users of machine learning models are focused on predictive modeling. Predictive models, also called supervised learning models, involve a target variable (the focus of your business or research) and inputs (the predictors that hopefully inform you about the target). Supervised machine learning models identify input-target associations in order to predict future values of the target.
The focus of predictive modeling is to make the most accurate predictions possible. This is a different goal than statistical inference, which usually involves explaining the input-target relationships. For predictive modelers, explaining the relationships among variables is of secondary importance at best. To assess the accuracy of the predictions, two common measures are the mean square error (MSE) and average square error (ASE). Typically, we use MSE for measuring error in the training data and ASE when applying a model to a data set such as validation data that was not involved in creating the model. Whichever metric we use, we can view this as the prediction error that we want to minimize.
So, all else being equal, a predictive model with lowest error can be considered a better model. How do we reduce the prediction error in our machine learning models? By optimizing the bias-variance trade-off. Both bias and variance contribute to the prediction error.
What is bias? In the context of predictions, a model makes biased predictions when the predicted values systematically underestimate or overestimate the actual values. It is the difference between the expected value of an estimate and its true value. Bias cannot be measured in real data because it requires knowing the true values which we are trying to learn in the first place. Models that are too simple to capture the true patterns in the data have high bias. Typically, we can reduce bias by increasing the complexity of the model (more parameters, higher order terms, etc.).
What is variance? In the context of predictions, a model has high variance when the predicted values are highly variable when the model is applied to different samples from the same population. You can also think of variance as the sensitivity of the predictions to the particular data set used to train the model. Variance cannot be measured from fitting the model to a single data set as is typically done, although it could be measured by fitting it to several bootstrap samples of the same data. Note that high variance here does not imply anything about making correct predictions. It is about the variability in the predictions, not how accurate they are. Models that are overly complex for the data have high variance. “Overly complex” here means that the model describes aspects of the observed data that are not representative of the population and are unique to the observed data. That is, overly complex models model the “noise” in addition to the “signal”. A model with higher complexity is more flexible and can fit the data with lower bias. But because of the increased flexibility, the model’s predictions can differ dramatically between data sets.
So high bias is associated with models that are too simple to describe the true patterns in the population and high variance is associated with models that are unnecessarily complex. You can imagine that as you increase model complexity from too simple to too complex, the model bias decreases and variance increases. There is a trade-off, where lowering one increases the other.
Prediction error increases with both the bias and variance:
MSE = bias 2 + variance + irreducible error
, the irreducible error being a component of the MSE unrelated to model complexity that won’t be considered here. Finding the model that has the lowest error involves finding the right amount of complexity, that is, the bias and variance that minimizes the error. This amount is not something you will know before some trial and error. The trial and error should be guided by measuring model performance on a holdout validation data set. It turns out that adding a small amount of bias often results in a large reduction in variance and an overall decrease in prediction error.
Here are illustrations of models with high bias or high variance. We have a population with the relationship Y= 7+ 11X +3X 2 . I’ll draw 3 samples from this population and fit models that are too simple (high bias) and models that are too complex model (high variance). Here’s how I generated the data for the first sample:
data sample;
call streaminit (123);
sample="sample 1";
do i=1 to 50;
X=i;
y=7+0.3*X+40*X*X +5000*RAND("normal");
true_y=7+0.3*X+40*X*X;
output;
end;
run;
The CALL STREAMINIT routine specifies a seed that starts the random number generator used by the RAND function. RAND(“normal”) generates a random value from a normal distribution with a mean of 0 and SD=1. Here’s a plot of the true relationship:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
When we approximate this true curvilinear relationship with a linear regression model, we are using a model that is too simple and the predictions are biased. At X-values near zero and near fifty, the predicted values (shown by the red line) are too low—they underestimate the true values. At X-values near 25, the linear regression overestimates Y.
What happens when we use a model that is too complex? I’ll approximate this curvilinear relationship with a spline function. Splines are flexible, piecewise polynomial functions joined together smoothly at join points called knots. I’ll use a spline model based on 10 th degree polynomials. The predictions are no longer biased too low at the extremes and too high near the center. But this spline model is overly complex, and it captures the noise in the data. Each little “bump” in the spline is a pattern found in this specific sample and not likely to be found in the next sample drawn from the same population. The spline model predictions have low bias, but they have high variance.
Let’s see a few more high bias and high variance models fit to other samples from the same population. To generate additional samples, I changed the seed of the CALL STREAMINIT statement in the SAS code above and concatenated the 3 samples into a data set called “combined”. I used the following PROC SGPANEL code to visualize models fit to these samples:
title "high bias";
proc sgpanel data=bias_var.combined;
panelby sample / novarname columns=3;
scatter x=X y=y /;
reg x=X y=y / nomarkers lineattrs=(color=red);
run;
title "high variance";
proc sgpanel data=bias_var.combined;
panelby sample / novarname columns=3;
scatter x=X y=y /;
pbspline x=X y=y / nomarkers degree=10 lineattrs=(color=green);
run;
We can see the linear regression is biased. In all three samples, the average predictions are different than the true values. So, this error is not a fluke. It is systematic due to using a model that is too simple for these data. Using a model that is too simple for the data is referred to as underfitting.
The 3 spline models demonstrate high variance. At first, the curves might seem similar but look at each peak and valley across samples. They don’t match up. That is, the predictions vary considerably across different samples. Models like these can be described as overfitting the data. These models are so specific to the data they were trained on that they will perform poorly at predicting new data. On average, the bias is low, but in practice, we will only have one sample for modeling. We may end up getting the sample and the model with predictions that are far from the average—there is no way of telling. Variance here is the sensitivity of the predicted values to the particular data set used to train the model.
So why does this matter? From a practical point of view, can’t we just focus on picking a model with lowest prediction error on a validation data set and ignore the bias-variance trade-off concept? Well, yes. But understanding bias and variance gives a conceptual framework that will help with using machine learning models. One situation this comes up in is tuning hyperparameters. Hyperparameters are the settings for a machine learning model that are established before the model is fit to the data. They set the guidelines for how the machine learning model goes about producing the predictions.
Here are some examples of hyperparameters that affect the bias and variance of the resultant model:
Many of the options in supervised machine learning SAS Viya procedures such as PROC BN, PROC FOREST, PROC GRADBOOST, PROC NNET, and PROC SVMACHINE are used to adjust the complexity of the model to optimize the amount of bias and variance. The same hyperparameters can appear in several models (e.g. L1, also called LASSO). Adjusting these will help find a model that avoids overfitting or underfitting. For example, a higher L1/LASSO parameter will make a model less sensitive to the training data to reduce the amount of noise being modeled and lower the chance of overfitting. Understanding bias and variance allows you to view a dozen different hyperparameters as methods to make models err on the side of too simple or too complex.
To learn more about the important concepts in machine learning modeling, try the SAS class “Statistics You Need to Know for Machine Learning”. This course covers the bias-variance trade off as well as many other important concepts that will help you on your journey as a machine learning modeler. It also is the best preparation for students interested in earning the SAS Credential “SAS Certified Associate: Applied Statistics for Machine Learning”.
See you at the next SAS class!
Links:
Course: Statistics You Need to Know for Machine Learning (sas.com)
SAS Certified Associate: Applied Statistics for Machine Learning
Find more articles from SAS Global Enablement and Learning here.
... View more
10-24-2024
02:19 PM
Principal component analysis (PCA) is an analytical technique for summarizing the information in many quantitative variables. Often it is used for dimension reduction, by discarding all but the first few principal component axes. These first few usually explain most of the variability in the original variables. How can we determine the number of principle component axes to keep for further analysis? In a previous post (How many principal components should I keep? Part 1: common approaches), I described some of the common approaches to determining the number of PC axes to retain. These approaches are all subjective and do not distinguish components that summarize meaningful correlations among variables from components that capture only sampling error. In this post, I will demonstrate how to construct a significance test for the number of PC axes to retain for analysis that avoids these limitations.
How many components to retain from PCA?
In a previous post (How many principal components should I keep? Part 1: common approaches), I described some of the common approaches to determining the number of PC axes to retain. These included keeping components that account for a set percentage of variability, retaining components with eigenvalues greater than 1, and components up to the "elbow" of a Scree plot. These approaches are easy to implement and may be adequate for many research goals. But each of these approaches is somewhat subjective, and more importantly, they are unable to distinguish principal components that represent correlations among variables from sampling error. For example, if we create data that is made of pure noise with no true correlation among variables, PCA will still produce eigenvalues and eigenvectors.
To illustrate this, I used PCA on randomly generated data with no structure. The code below creates a data set randnumbers (n=100) with four variables X1 through X4, each taking on integer values between 1 and 100. These variables are randomly generated with no underlying structure or associations among them. Any correlation among them is purely due to sampling error (random chance). I used PROC PRINCOMP on these data to carry out PCA on the randnumbers data.
data randnumbers;
call streaminit(99);
do i=1 to 100;
x1=rand("integer", 1, 100);
x2=rand("integer", 1, 100);
x3=rand("integer", 1, 100);
x4=rand("integer", 1, 100);
output;
end;
run;
proc princomp data=randnumbers;
var x1 x2 x3 x4;
run;
The PCA on the random data produced the following eigenvalues and Scree plot:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Saving PCs that explain at least 75% of the variation would result in retaining PC1 through PC3, which collectively explain almost 80% of the variability in the original predictors. Retaining components with eigenvalues >1 results in treating only PC1 as meaningful. Using the Scree plot bend method suggests retaining the first 2 PCs. But we know that none of these components represents true structure in the underlying data.
Randomization tests to determine the significant number of principal components
An alternative approach is to use a randomization test to determine the number of principal components to keep. The idea here is that we want to distinguish meaningful PCs from noise. That is, we want to know if the PCs explain more variability than similar data with no real structure that summarize no real correlations. To do this, the values of each variable can be randomly shuffled, and the PCA can be recalculated for the new structureless data set. All the original data is retained in these permuted data sets, but the values of each of the variables are randomized relative to one another. This randomization followed by PCA is repeated many times (say 100 to 1000 times) and the eigenvalues from each randomly permuted data set are retained.
This produces a null distribution of eigenvalues that can be used to test the hypothesis that the original data has no structure, that is, that the correlations represented by components is entirely due to sampling error. To carry out the test using alpha=0.95, the first eigenvalue of the real data PCA would be compared to the 95 th percentile of the null distribution of the first eigenvalue. If it is smaller than the 95 th percentile of the distribution of null distribution of PC1 eigenvalues, then the first PC represents sampling error and not true structure in the original data and none of the PC axes should be retained. If the first PC eigenvalue is greater than 95% of the null distribution, it represents significantly more correlation among variables than you would expect from random chance. If the first PC is significant, then proceed to testing PC2 the same way (and so on). The test could be constructed with other alphas, such as alpha=0.01 by comparing to the 99 th percentile of the null distribution.
Here is how I conducted a randomization test with the PCA on Fisher’s iris data. These data include 4 variables measuring floral morphology for 3 species of iris. The variables are sepal length, sepal width, petal length, and petal width. Fifty flowers of each of 3 Iris species were measured for a total of n=150. These data are included with SAS in sashelp.iris.
First, I added a variable called sort to the data which has the consecutive values 1 through 150, then used PROC PRINCOMP to carry out PCA. The results were shown in my previous post (How many principal components should I keep? Part 1: common approaches).
In the macro program irisRand, I used PROC PLAN to randomly shuffle the integers values of sort, while keeping the original order of the rest of the variables. The output data were sorted by sort and only sort and one floral variable were retained. This was repeated for each of the 4 floral variables and the randomly shuffled values were merged by the macro program null_dist_PCA. PROC PRINCOMP was run on these randomized data and the eigenvalues were saved, transposed, and appended to a data set null_EVs. This was repeated 1000 times and PROC UNIVARIATE was used to find the 99 th and 95 th percentiles of the null distribution of each of the 4 principal components.
Here is the SAS code:
/* create variable sort for randomization */
data iris;
set sashelp.iris;
sort=_n_;
run;
/*find eigenvalues for original iris data */
proc princomp data=iris;
var SepalLength SepalWidth PetalLength PetalWidth;
run;
/*randomize the order of sort variable */
%macro irisRand (trait=, traitno=);
proc plan;
factors sort=150/noprint;
output data=iris out=randomized;
run;
/* sort by sort number to randomize the rows. Drop everything but one trait*/
proc sort data=randomized out=random&traitno (keep=&trait sort);
by sort;
run;
%mend irisRand;
/* randomize values of 4 traits and merge the data sets,
carry out PCA and save eigenvalues, transpose and append
eigenvalues, repeat &iter=1000 times */
%macro null_dist_PCA (iter=);
%Do j=1 %to &iter;
%irisRand (trait=sepallength, traitno=1);
%irisRand (trait=sepalwidth, traitno=2);
%irisRand (trait=petallength, traitno=3);
%irisRand (trait=petalwidth, traitno=4);
data randomized_iris;
merge random1-random4;
drop sort;
run;
ods select none;
proc princomp data=randomized_iris;
var SepalLength SepalWidth PetalLength PetalWidth;
ods output Eigenvalues=ev(keep=eigenvalue);
run;
ods select all;
proc transpose data=ev out=ev_transposed (drop=_name_) prefix=eigen;
run;
proc append data=ev_transposed base=null_EVs;
run;
%end;
%mend null_dist_PCA;
/*prevents overflow of log*/
proc printto log='randomization_log.txt' new;
run;
%null_dist_PCA (iter=1000);
/*resets log to original destination*/
proc printto log=log;
run;
proc univariate data=null_EVs noprint;
var eigen1 - eigen4;
output out=crit pctlgroup=byvar pctlpts=99 95
pctlpre=eigenvalue1_
eigenvalue2_ eigenvalue3_ eigenvalue4_ pctlname=P99 P95;
run;
proc print data=crit;
run;
Here are the eigenvalues from the original iris data:
And here are the 99th and 95th percentile of the null distribution for each of the four eigenvalues:
So, using the cutoffs for statistical significance of the 99 th percentile (or 95 th ), only the first eigenvalue is significant. That is, we cannot reject the null hypothesis that the correlations summarized in PC2 through PC4 are due to random noise. This is a different result than what would be achieved from retaining PCs that explained at least 75% of the variability in the original predictors. And importantly, it is less subjective than several other approaches (despite the choice of alpha itself being subjective). Of course, the decision on which method to use will depend on the specifics of your research question. Regardless of your favorite method, it can be useful to have this tool in your statistical toolkit.
For more information on PCA and other multivariate techniques, try the SAS course Multivariate Statistics for Understanding Complex Data. You can access this course as part of a SAS learning subscription (linked below).
See you at the next SAS class!
Links:
Course: Multivariate Statistics for Understanding Complex Data (sas.com)
SAS Learning Subscription | SAS
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- analytics
- biostatistics
- components
- data mining
- data science
- Data Summarization
- Dimension Reduction
- featured
- GEL
- Multivariate Statistics
- Multivariate Statistics for Understanding Complex Data
- pca
- Principal Component Analysis
- proc pca
- PROC PRINCOMP
- randomization test
- statistics
- Unsupervised Models
10-24-2024
02:09 PM
2 Likes
Principal component analysis (PCA) is an analytical technique for summarizing the information in many quantitative variables. Often it is used for dimension reduction, by discarding all but the first few principal component axes. The first few components usually explain most of the variability in the original variables. How can we determine the number of principal component axes to keep? There are several commonly used approaches which I will describe in this post. The number of axes used often depends on a percentage of variability explained, component axes with eigenvalues greater than 1, or visual inspection of a Scree plot. In a follow up post, I will describe how to construct a significance test for the number of principal component axes to retain for further analysis.
What is PCA?
PCA is a widely used summarization and dimension reduction technique used with multivariate data. It starts with a set of original variables to be summarized. New derived variables are constructed that are weighted linear combinations of these original variables. These derived variables are called principal components [PCs], and the number of PCs created are the same as the number of original variables.
For example, let’s look at a scatter plot of the heights and weights of 19 students from the sashelp.class data set:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Height and weight are positively correlated. A principal component analysis of these data would result in 2 principal component axes, the first going through the widest spread of the scatterplot and the second axis perpendicular to the first:
These new variables, the PC axes, are functions of the original variables, in this case PC1= 0.71*height+0.71*weight and PC2=0.71*height−0.71*weight. When the PCs are functions of only a few variables as in this case, they can be interpreted. The first PC which is based on positive weights or loadings of the height and weight variables could be interpreted as a measure of overall size or stature. In one sense, PC1 could be considered the “real” variable, and height and weight could be considered two ways of measuring overall size. The second PC could potentially describe how slim vs stocky the students are for their given size, but we can’t directly see that from the data. When the PC are functions of more than a few variables they generally become uninterpretable.
With a third variable such as age included in the data, a third PC would be produced. This third PC axis would be perpendicular to the first two. When PCs are used for further analysis, usually only the first few PCs are retained, as they explain most of the variability in the original predictors. Running PCA on height, weight, and age shows that the first two components explain over 96% of the variation in the three original variables. This is the basis for using PCA for dimension reduction—there is minimal loss of information when PC3 is not used in future analyses.
Dimension reduction through PCA can be an early step in developing machine learning models. For data scientists with many predictors, PCA can be used for replacing many of the original inputs with relatively few PCs. Additionally, using PCs in place of the original inputs has the added benefit of removing any potential collinearity problems. Collinearity describes strong correlations among sets of predictors, and it can lead to unstable parameter estimates with large standard errors. This is not a problem with PCs because they are constructed in a way that ensures all PCs are uncorrelated.
How principal components are constructed
The first principal component, PC1, is constructed such that it has the greatest variance of any possible linear transformation of the original variables. Further PCs are constructed such that PC2 accounts for the second greatest proportion of the variability among the original variables, PC3 accounts for the third greatest proportion and so on. The variances of the PCs are called eigenvalues. The PCs have the property that the correlation among any principal components is exactly zero.
PCs can be constructed based on correlations among variables or based on covariances. Using correlations is a much more common approach. PCA based on covariances is highly sensitive to the scaling of variables. The original variable with the greatest variance will typically be associated with the first PC axis to the exclusion of all other variables. To remove the effect of scale, correlations are preferred unless the original variables are already on a similar scale. Principal component analysis can be carried out programmatically using the SAS 9 procedure PROC PRINCOMP or the SAS Viya procedure PROC PCA.
Because the first few PCs tend to explain the majority of variation, often the remaining PCs can be discarded with minimal loss of information. The assumption is that early components summarize meaningful information, while variation due to noise, experimental error, or measurement inaccuracies is contained in later components. When PCs are discarded in this manner, the number of dimensions (variables) can be greatly reduced. And unlike the original variables, it is impossible to have collinearity problems with the PCs since they are perfectly uncorrelated. But how many of the PCs should be kept for further analysis? If too few components are retained, future analyses using the components will suffer from loss of the relevant information. If too many components are retained, the additional noise can distort the underlying pattern of correlations among variables that the components are supposed to summarize.
How many PCs to retain?
One method to decide how many PC to retain for analysis is to pick a proportion of variance to explain that sounds reasonable. Would you be content to keep PCs until 75% of the variability in the original variables is accounted for? How about at least 90%? You may not even need to decide advance, and simply look at your PCA results and pick the number of PC axes that save most of the original variability. While being arbitrary, it may be reasonable to do this for some research goals.
Another approach is to retain all PCs that have eigenvalues greater than 1. When PCA is based on correlations, the sum of the variances of PCs (the eigenvalues) will equal the number of predictors being summarized. This means the average of the eigenvalues equals 1. So, using eigenvalue>1 as a threshold for considering a PC to be meaningful amounts to keeping all PCs that are account for more variability than the average.
A third approach involves examining a plot of the variance of each PC vs the PC number, called a Scree plot. “Scree” means an accumulation of rocks at the base of a mountain or cliff. If you picture the profile of a cliff, bending to meet the ground, this is the typical shape of a Scree plot and where it gets its name (thanks, Wikipedia). Typically, the curve connecting the components will have a bend, making an “elbow” shape. This point at which the curve flattens out indicates the maximum number of PCs to retain, and those below the elbow are discarded. What’s the justification for this? Recall that the eigenvalues decrease sequentially from the first component to the last. So, the Scree plot typically shows a steep decline that asymptotically approaches zero. Where the graph straightens out, the components are explaining a relatively small proportion of variance.
To illustrate these approaches, I used the SAS 9 procedure PROC PRINCOMP to analyze Fisher’s famous iris data set. These data contain 4 variables measuring floral morphology for 3 species of iris. The variables are sepal length, sepal width, petal length, and petal width. Fifty flowers of each of 3 iris species were measured for a total of n=150. These data are included with SAS in sashelp.iris.
proc princomp data=sashelp.iris plots=all;
var _numeric_;
run;
The eigenvalues table shows the variances (eigenvalues) of the four principal components, along with their differences, their proportion of the total variance, and the cumulative variance explained:
If we want to save components that represent at least 75% of the variability, the first two PCs would be retained. These account for over 95% of the variability in the original 4 predictors. If instead, eigenvalues >1 were retained, only the PC1 with an eigenvalue of 2.9 would be kept. All remining PCs have eigenvalues below 1.
The Scree plot shows a steep decline in eigenvalue from PC1 to PC2 and a flattening out of the curve between PC3 and PC4:
Depending on where one sees the elbow, examining this Scree plot suggests keeping 3 or possibly only 2 PC for further analysis. Note that some authors suggest discarding the elbow point as well and only retaining the PCs before the bend.
There are several other graphs that can be produced to help interpret the principal components. To learn more generally about the graphical output of PROC PRINCOMP, see Rick Wicklin’s blog How to interpret graphs in a principal component analysis - The DO Loop (sas.com).
An advantage of these approaches is that they are all easy to implement. If the goal is dimension reduction, any of these approaches may be reasonable. But when focus of the PCA is summary of the data with meaningful axes of variation, the choices presented here are might not be satisfactory. They rely on subjective, somewhat arbitrary choices. An additional downside of these methods is that none of them consider that the correlation captured by the principal components could be entirely due to sampling error. In my next post, I’ll demonstrate an approach for constructing a test for significant principal components that can overcome these limitations.
For more information on PCA and other multivariate techniques, try the SAS course Multivariate Statistics for Understanding Complex Data. You can access this course as part of a SAS learning subscription (linked below).
See you at the next SAS class!
Links:
How to interpret graphs in a principal component analysis - The DO Loop (sas.com).
Course: Multivariate Statistics for Understanding Complex Data (sas.com)
SAS Learning Subscription | SAS
Find more articles from SAS Global Enablement and Learning here.
... View more
02-19-2024
02:11 PM
Last year, a student in the SAS Visual Statistics: Interactive Modeling Building class I was teaching asked about modeling a continuous, skewed response variable with a large number of zeros. He asked for some possible approaches using SAS Visual Statistics, and we ended up talking about two-stage modeling. In this post, I’ll walk through how we used SAS Visual Statistics to fit a two-stage model to the course data set (VS_Bank).
VS_bank data
VS_Bank is a simulated data set from a large financial services firm’s accounts. It has 1.1 M rows, with 12 interval-scaled predictors (Logi_RFM1–Logi_RFM12) and 2 categorical predictors (Cat_input1, Cat_input2). The data set has 3 target variables: a binary target indicating if a purchase was made or not (B_tgt), an interval target indicating the total value of the purchases (Int_tgt) and a count target indicating the number of purchases (Cnt_tgt, not used in this post). The data set has roughly 20% purchasers (B_tgt=1) and 80% non-purchasers (B_tgt=0). The large number of zeros for the total value of purchases were coded as missing for the interval target (Int_tgt had 80% missing values). All the missing predictor values were imputed (Int_tgt was not imputed) and interval predictors were log transformed to reduce skewness prior to modeling (hence the “logi_” prefix).
Two-stage modeling
The goal we chose for modeling with the VS_bank data was to predict the most valuable potential customers for the bank’s marketing department to target. The two-stage modeling approach involves modeling separately whether a customer is a likely purchaser (the stage-1 model using the binary target, B_tgt), and to use the predicted probability of making a purchase as an input for modeling the total value of purchases (the stage-2 model using the interval target, Int_tgt). Incorporating the predictions from the stage-1 model into the stage-2 model can likely improve its predictive power, particularly when the $0 total value customers (i.e., the non-purchasers) were coded as missing values for Int_tgt. Note that the same or different predictors can be used for the two models.
The predicted values from both models can be multiplied. Then customers can be ranked from highest to lowest value for marketing. Why multiply the probability of purchasing by the predicted total value of purchases? The product can be thought of as the expected future profit, and multiplying can give an improved assessment of customer profitability, compared to looking at the predictions for total value of purchases (the stage-2 model target) alone. This is because the customers who are predicted to make the largest dollar amount of purchases might not be very likely to make any purchase at all. If the predicted probability of B_tgt=1 and Int_tgt are negatively correlated, using only the predicted value of Int_tgt would paint a misleading picture of the customers most valuable to target for marketing. This is essentially what we found after producing a stage-1 model and plotting the relationship between the stage-1 model predictions and the stage-2 target:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Stratified partitioning for honest assessment
We decided to partition the data into training and validation sets and asses each potential stage-1 and stage-2 model on its validation performance. The simplest approach to partitioning is to use a random sample of say 50% for training and 50% for validation. Instead, I recommend using a stratified random sample for each partition. For the stage-1 model (predicting a binary target), the partitions should be stratified based on the target. Stratification means that when making the training and validation data sets, the exact same frequency of purchasers will be in each partition. This makes the performance on the partitions more comparable than if there were say 22% purchasers in the training data and 18% purchasers in the validation data. A stratified partition can be created using the New Data Item option in Visual Analytics, then choosing Partition:
For the interval target models, care should be taken with creating the partitions. Monetary values are often highly skewed, and this is the case with the interval target, Int_tgt. If, by chance, all the rare high value purchasers ended up in the validation data with none in training, the model performance will suffer. To avoid this, a partition can be created that is stratified on a binned version of the interval target. We did this using the New Data Item option to create a measure variable “binned_purchases”: missing values were put in bin 0, total purchases <= $50K went into bin 1, total purchases <= $100K went into bin 2, and purchases greater than $100K were put into bin 3 (the max was $500K):
We then used binned_purchases to create a categorical variable (bins) by choosing New Data Item, then Custom Category. This categorical variable was used for stratification of the second partition we created. SAS Visual statistics will hold one partition variable at a time, so the stage-1 model partition was hidden when the stage-2 partition was created.
Stage-1 models
We tried three stage-1 models for predicting the binary target B_tgt: logistic regression with backwards elimination based on SBC, logistic regression with stepwise selection based on AIC and a decision tree using default settings. The models were compared on their misclassification rate when applied to the validation data:
Model
Validation misclassification
Logistic backwards
(SBC)
0.1615
Logistic stepwise
(AIC)
0.1614
Decision Tree
0.1645
The AIC-based stepwise logistic model had the best performance and the Derive predicted option was used to add the predicted probabilities to the Data pane.
Stage-2 models
All of the stage-2 models used the predicted probability from the final stage-1 model as an input. How should we model the interval target? A linear regression model would be inappropriate for these data because of the high skewness of the target, Int_tgt. We decided to try two generalized linear models: lognormal and gamma regressions. These models both use continuous distributions that can account for skewness in the response. Below is a picture of the skewed lognormal and gamma distributions, compared to a (truncated) normal distribution. Note that the lognormal distribution has heavier tails than gamma and are more skewed than a truncated normal distribution:
Image modified from SAS Statistics 2: ANOVA and regression course notes
Another advantage to using lognormal and gamma regression models is that these distributions are strictly positive. So, unlike a linear regression model, these models will never produce a nonsensical prediction of a negative valued purchase. The lognormal model and the gamma model are available using a Generalized Linear Model object and choosing the normal distribution with a log link and the gamma distribution with the log link, respectively from the options pane:
How do we get predictions for the 80% of the customers that did not make a purchase? The $0 purchasers were coded as missing values for the target. When a model is built in SAS on data that include missing values of the target, the model is fit on the only the rows with complete data (200K rows here), but predictions are made on the complete data set (1.1 M rows). Sometimes a researcher will use this to get predictions from a model by concatenating a data set to be scored (missing the target) with the data used for fitting a model, then calculating the predictions. This is sometimes called the missing-Y trick.
It’s likely the predicted probability from the stage-1 model is non-linearly associated with the interval target. How should we account for this non-linear relationship? One way to model a non-linear relationship is to use a spline function. A spline is a piecewise polynomial function where the pieces are smoothly joined at knots. Splines are very flexible and can be used to model complex non-linear relationships that are difficult to model with polynomials or interactions. The spline effects are typically not interpretable, but the non-spline terms in a regression model still retain their interpretability. To add a spline to the lognormal and gamma models, we used a Generalized Additive Model (GAM) object and created a spline of the predicted probability of B_tgt using the New data item menu and choosing Spline effect:
We let SAS Visual Statistics do the hard work of iteratively finding a spline function for our generalized additive model using the default settings. For more information on creating generalized additive models in SAS programmatically or using SAS Model Studio, see Beth Ebersol’s post here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-Use-Generalized-Additive-Models-in-SAS-Viya/ta-p/895086.
The lognormal and gamma models used backwards elimination with variables removed based on the validation average squared error. The GAM models used the same predictors as the corresponding generalized linear models except the stage-1 predictions were removed from the Continuous effects role and a spline function of these stage-1 predictions was added. We also tried a decision tree as a stage-2 model. The default decision tree settings were used, and the validation Average Squared Error was compared for each model:
Model
Validation
ASE
Generalized linear model:
normal distribution, log link
55,646,643
GAM:
normal distribution, log link
55,539,930
Generalized linear model:
gamma distribution, log link
56,074,933
GAM:
gamma distribution, log link
60,839,972
Decision Tree (default settings)
51,307,883
The decision tree performed the best of the 4 models tested. The predictions from the decision tree model were saved using the Derive predicted option. We then created a list table with the customer account IDs, the predictions from the stage-1 logistic regression, the predictions from the stage-2 decision tree, and the expected future profit (the product of the predictions). Finally, we sorted the table by expected future profit:
To use the two-stage model with new data, the Export model option could be used for both models to produce SAS data step scoring code which could be implemented in the SAS Studio interface.
Summary and more information
This was a brief example of two-stage modeling using SAS Visual Statistics. Would you like to know more? Generalized linear models are a great tool to expand your modeling skills beyond linear regression. These models are discussed in the course Statistics 2: ANOVA and Regression and the SAS Viya course SAS Visual Statistics: Interactive Modeling Building. Generalized Additive Models are addressed in the course Regression Methods Using SAS Viya. Hope this post has given you some new ideas about how to model your data.
See you at the next SAS class!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GAM
- generalized additive models
- generalized linear models
- logistic regression
- Regression Methods Using SAS Viya
- sas visual statistics
- SAS Visual Statistics: Interactive Modeling Building
- spline
- Statistics 2: ANOVA and regression
- two-stage modeling
Labels:
01-17-2024
06:42 PM
2 Likes
Do you frequently use binary logistic regression? If you do, you may have encountered a problem called quasi-complete separation. This problem is often associated with categorical variables with many levels and it can cause the parameter estimates and p-values for your model to be untrustworthy. In this post, I’ll explain the problem and my favorite approach to dealing with it: converting the categorical predictor to a continuous predictor using smoothed weight of evidence coding.
So, what is quasi-complete separation? Before describing quasi-complete separation, it may be helpful to understand complete separation in logistic regression. For this example, I’m using the AmesHousing3 data set from the SAS course Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression (these data are a modified subset of the data from De Cock 2011). In these data, each observation is a house that was sold in Ames, Iowa between the years 2006–2010. The predictor variables in these data are descriptors of the houses such as the basement area in square feet (Basement_Area) and the number of bedrooms (Bedroom_AbvGr). The response variable Bonus is a binary variable with the value 1 indicating that the realtors selling houses were eligible to receive a bonus for the sale and 0 indicating they were not.
Using PROC LOGISTIC to fit the model for Bonus with the predictors Basement Area and SalePrice of the houses results in the following warning in the log and convergence status message in the results:
proc logistic data=STAT1.AmesHousing3;
model Bonus(event='1')= Basement_Area SalePrice;
run;
WARNING: There is a complete separation of data points. The maximum likelihood estimate does not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
This is accompanied by unusual p-values for the overall model. Assuming alpha=0.05, the likelihood ratio p-value below says the model is highly significant (p<0.0001), but the Wald chi-squared p-value is at least 500X bigger than Likelihood Ratio p-value (p=0.0504).
Have you ever seen this complete separation problem come up in your own logistic regression analyses? Possibly not, because this is actually a rare problem. The related problem “quasi-complete separation” is much more common. With complete separation, SAS is telling us that the maximum likelihood estimate does not exist. What caused this?
Well, I caused it. I was able to create the complete separation problem by running a binary logistic regression problem with a “perfect” predictor. In the AmesHousing3 data set, Bonus =1 when the price of a house (SalePrice) is greater than $175,000 otherwise Bonus=0. So, SalePrice completely separates the observations into the two possible outcomes for the response: bonus eligible and bonus ineligible. Complete separation means that there is some linear combination of the explanatory variables that perfectly predicts the dependent variable. When this happens, the maximum likelihood algorithm does not converge and the results of the model cannot be trusted.
Quasi-complete separation
Quasi-complete separation is very similar to complete separation. When a continuous predictor nearly completely separates the binary response into different categories, it can produce quasi-complete separation. But the more common cause is having a categorical predictor with levels that have all events or non-events for each case. Let’s see an example of this.
Now I’m changing my logistic regression model to have Bedrooms_AbvGr (bedrooms above ground) as the sole predictor. This categorical variable has 5 levels, corresponding to 0–4 bedrooms.
proc logistic data=STAT1.ameshousing3;
class Bedroom_AbvGr/param=ref ref=first;
model Bonus(event='1')= Bedroom_AbvGr;
run;
Here is the warning from the log and the model status message in the results:
WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.
Partial output:
Bedrooms is not a significant predictor but with the quasi-complete separation warning, the parameter estimates, odds ratios, and p-values can’t be trusted. Even before we ran the model, we could tell that there will be a problem when we look at a cross-tabulation of Bonus and Bedrooms:
Of the houses with 0 or 4 bedrooms above ground, 0 were bonus eligible. This is problematic for logistic regression because the logit for each level is calculated as the natural logarithm of (# bonus eligible houses/# bonus ineligible houses) and the natural log of zero is undefined. This could make the parameter estimates invalid for the levels with zero events (i.e., 0 bonus eligible houses). Even worse, if one of these levels is the reference level for the parameter estimates (the last level 4 is the default but I used level 0) then the parameter estimates for all levels will be affected. The same problem would occur if a level had no bonus ineligible houses because calculating the logit would involve dividing by zero which of course is also not defined.
Are the levels with zero events really perfect predictors? No. Small sample size in a level can result in zero events or non-events by chance. Perhaps with a larger sample size for say 4-bedrooms, there might be some bonus eligible houses. This quasi-complete separation problem can come up with small sample sizes or when categorical predictors have many levels, since there are more chances to have zero events or non-events. How can we avoid this problem?
One way to remove the quasi-complete separation problem from your logistic regression is to convert the categorical predictor into a continuous predictor. The continuous predictor can be calculated using smoothed weight of evidence (SWOE) coding. SWOE is similar to averaging the log odds in a particular level with the log odds of the overall sample. Let’s see the formula for calculating smoothed weight of evidence.
If we replaced the levels of bedrooms with the natural log of the odds of being bonus eligible (i.e., the logit), we would be using “weight of evidence” (WOE) coding:
WOE coding replaces each level of the categorical predictor with the logit, a continuous variable. But the logit is undefined when there are zero events or non-events. So we need to modify WOE to get rid of zeros in the numerator and denominator. One approach is to add a small number to both the numerator and the denominator (say 0.5) to remove any zeros. Another approach is to calculate smoothed weight of evidence (SWOE):
where ρ 1 = the proportion of events in the entire sample and c is a smoothing parameter. Since ρ 1 /(1-ρ 1 ) is the odds of the event for the whole sample, the equation for SWOE is like a weighted average of the logit for each level and the logit for the overall sample. Higher values of c reduce the variability of the logits for each level and make them more similar to the logit for the overall sample.
What’s a good choice for the smoothing parameter c? This is an empirical question that depends on your specific data set. Several values of c can be tested to see which produces the SWOE that has the strongest relationship with the response variable. Ideally, c could be found using a hold-out validation data set to reduce overfitting.
I tried a few values of the smoothing parameter (1, 5, 10, 15, 25) and found c=1 produced the best SWOE. Bear in mind that AmesHousing3 is a small data set (n=300) and I did not use a hold-out validation data set to find c. Here’s the code I used to calculate SWOE and add it to the data. The PROC MEANS step creates a data set (Counts) that has the number of bonus eligible houses (events) and the total number of houses (_FREQ_) for each level of Bedrooms. In the Data step, I calculate SWOE (bedrooms_swoe) using ρ 1 =0.176 and the smoothing parameter c=1.
proc means data=STAT1.AmesHousing3 sum nway;
class Bedroom_AbvGr;
var Bonus;
output out=counts sum=events;
run;
%let rho1= 0.176; *45 bonus eligible/255 bonus ineligible houses;
%let c=1;
data counts;
set counts;
nonevents=_FREQ_-events;
bedrooms_swoe=log((events+&c*&rho1)/(nonevents+&c*(1-&rho1)));
run;
proc print data=counts noobs;
var Bedroom_AbvGr events nonevents bedrooms_swoe;
run;
The PROC PRINT output:
Then to add SWOE to the data and fit the model for Bonus with the continuous predictor, I used the following code:
data AmesHousing3;
set Stat1.AmesHousing3;
if Bedroom_AbvGr=0 then bedrooms_swoe=-2.772588722;
if Bedroom_AbvGr=1 then bedrooms_swoe=-1.393311934;
if Bedroom_AbvGr=2 then bedrooms_swoe=-1.676328118;
if Bedroom_AbvGr=3 then bedrooms_swoe=-1.747823684;
if Bedroom_AbvGr=4 then bedrooms_swoe=-3.912023005;
run;
proc logistic data=AmesHousing3;
model Bonus(event='1')=bedrooms_swoe;
run;
Partial output:
The model converged and bedrooms_swoe has more reasonable parameter estimates, odds ratios, and p-values.
Not only does SWOE coding remove any quasi-complete separation problem, it also reduces the number of parameters being estimated in the model. In these models, 4 parameters are being estimated for Bedrooms_AbvGr but only 1 is estimated for bedrooms_swoe. What if instead of a 5-level categorical predictor, we had a 1,000 level categorical predictor? SWOE coding would convert this to a continuous predictor and still require only 1 parameter estimate instead of nearly 1,000.
Are you interested in learning more about SWOE and other methods for dealing with quasi-complete separation? Then consider taking the SAS class Predictive Modeling Using Logistic Regression. Not only will you learn about other approaches to dealing with the quasi-complete separation problem, this class is great preparation for attaining the SAS Statistical Business Analyst credential.
References
De Cock 2011. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.” Journal of Statistics Education Volume 19, Number 3
... View more
- Find more articles tagged with:
- Ames housing
- complete separation
- GEL
- logistic regression
- Predictive Modeling Using Logistic Regression
- proc logistic
- PROC LOGSELECT
- quasi-complete separation
- smooth weight of evidence
- smoothed weight of evidence
- SWOE
- weight of evidence
- woe
Labels:
01-17-2024
06:25 PM
2 Likes
A question about model hierarchy came up recently while I was teaching the SAS class Predictive Modeling Using Logistic Regression. In class, we were using PROC LOGISTIC to investigate all 1.12 quadrillion possible statistical models given 50 effects by using the BEST=SCORE option. This option produced a list of models of varying sizes, from 1 effect to 50 effects, showing the single best model of each size ("best" meaning having the highest score chi-squared). The 3-effect model with the highest score chi-squared statistic was y= X 1 + X 2 + X 3 *X 4 , but we ignored it because it is not a hierarchically well-formulated model. Not everyone was familiar with model hierarchy and why it is important. So, in this post, I’ll explain why it’s best practice to maintain model hierarchy in your regression models.
What does it mean to maintain model hierarchy?
Maintaining model hierarchy means regression models with interactions (e.g., X 1 *X 2 ) include the corresponding main effects (e.g., X 1 and X 2 ) even if they are non-significant. Model hierarchy also applies to polynomial models in that a model with higher order polynomial terms (e.g., X 3 ) includes all lower order terms (e.g., X 2 and X) even if these are nonsignificant. Models that maintain model hierarchy are sometimes referred to as “hierarchically well-formulated models”.
So model hierarchy is maintained in these models:
Y= A + B + C + D + C*D
Y= A + B + C + B*C
Y= X 3 + X 2 + X
Y= X 2 + X
but not these models:
Y= C*D
Y= A + B + B*C
Y= X 4
Y= X 3 + X 2
Why is it best practice to maintain model hierarchy?
There are several reasons to use hierarchically well-formulated models in your business or research. In brief, (1) they allow fine tuning of the fit of the model to your data, (2) omitting the lower order terms makes strong (often incorrect!) assumptions about relationships in your data, (3) they make the interaction effect inference invariant to coding, and (4) not maintaining hierarchy opens the door to absurdly complex models.
Maintaining hierarchy allows fine-tuning the fit of the model to your data
Often, we’re trying to fit a regression model to our data without having strong theory telling us what the model should be. We see a Y vs X scatter plot showing a straight pattern, so we include the variable X. If we see a bend in the pattern, we include X 2 . But while X 2 captures the curvature, including X provides additional useful information about the shape of the response function. For example, in this picture below, we can see how changing the coefficient for X in y= X 2 + bX changes the shape of the X-Y relationship:
Modified from Wikipedia https://commons.wikimedia.org/wiki/File:Quadratic_equation_coefficients.png)
Select any image to see a larger version. Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.
Keeping bX in the model provides greater flexibility to the model shape. So, including the lower order terms will likely allow better fit of your models than leaving it out.
Omitting the lower order terms makes strong assumptions about relationships in your data
Let’s start off with a regression model with 2 predictors and their interaction:
y=b 0 + b 1 X 1 +b 2 X 2 + b 3 X 1 X 2 + e
When dealing with an interaction model such as the one above, we are often taught to not interpret the coefficients b 1 and b 2 when the interaction is significant. This is because a significant 2-way interaction means the effect of one predictor on the response depends on another predictor. So it doesn’t make sense to interpret either of the interacting predictors alone. While this is true, the coefficients b 1 and b 2 do have meaning in this interaction model, and it is not the average effect of the predictors. The meanings of the coefficients are:
b 1 = the effect of a unit-increase in X 1 when X 2 =0
b 2 = the effect of a unit-increase in X 2 when X 1 =0.
[Note: the above interpretation only makes sense when X 1 and X 2 have true zero points (i.e., are on ratio scale) as in measures of size or distance. Interval scaled continuous variables have arbitrary zero points, so it would not be meaningful to interpret the effect of one variable while another variable was set to its arbitrary zero point.]
Leaving out X 1 and X 2 from your model is equivalent to insisting that b 1 and b 2 both equal zero. This is a strong assumption that X 1 has zero effect on Y when the other X 2 equals zero (yet X 1 is still important to the interaction). Even if the effect is not statistically different from zero, if it is not exactly zero, leaving it out will bias your other parameter estimates. Remember that a non-significant statistical test does not mean that the null hypothesis is true. So, to sum it up, this assumption shouldn’t be made without a justification for it.
When is it justified? I found it difficult to think of an example when this would be reasonable but here goes: you know those epoxy adhesives that come in two tubes? When the two liquids are combined, they form a strong epoxy adhesive that will glue practically anything to anything. A model of adhesive strength (y) that included the amount of epoxy components (X 1 and X 2 ) could have zero coefficients of the main effects (since they are inert alone) but still have a strong interaction effect. Without strong justification, you’re better off maintaining model hierarchy and keeping the main effects in your model.
Keeping lower order terms makes interaction effect inferences invariant to coding
Model hierarchy is desirable because models that are hierarchically well formulated have inferences that are invariant to the coding that you choose for your predictors. One common way to recode predictors is to center them. Centering means subtracting the average from each value so that the transformed variable has a mean of zero. There are a variety of reasons to center data including reducing collinearity when interactions are present and making the intercept term more meaningful. Let’s see how centering affects inferences for models that do and don’t maintain model hierarchy.
For this example, I’ll use the data set sashelp.cars. I’ll use PROC STDIZE to create centered versions of the predictors Cylinders and MPG_City and save them to an output data set called centered. Then I’ll create 4 regression models, each using the response variable MSRP. The predictors for the first model are Cylinders, MPG_City, and their interaction. The second model is a version of the first model that uses centered variables. Model three is not a hierarchically well-formulated model—it only has the Cylinders * MPG_City interaction and is missing main effects. The fourth model is the same as model three, except that centered variables are used.
The code used is shown below. What happens to the coefficient of determination (R 2 ) and the p-values for the interactions when the variables are centered?
proc stdize data=sashelp.cars out=centered (keep=msrp Cylinders MPG_City centered:)
sprefix=centered_ oprefix;
Var Cylinders MPG_City;
run;
*hierarchically well-formulated model;
proc glm data=centered;
model msrp=cylinders|mpg_city;
run;
*hierarchically well-formulated model (centered variables);
proc glm data=centered;
model msrp=centered_cylinders|centered_mpg_city;
run;
*no model hierarchy;
proc glm data=centered;
model msrp=cylinders*mpg_city;
run;
*no model hierarchy (centered variables);
proc glm data=centered;
model msrp=centered_cylinders*centered_mpg_city;
run;
Here is the output from models 1 through 4. Focus your attention on the R 2 and the interaction p-values from the Type 3 SS tables:
Model 1
Model 2
Model 3
Model 4
Centering the variables does not affect the interaction p-value or the model R 2 for the hierarchically well-formulated model. But in the interaction models without main effects, centering the predictors changes the R 2 from 0.07 to 0.005 and changes the interaction p-value from p=0.01 to p=0.14. This is not a desirable outcome!
We can see why omitting constituent terms from an interaction is problematic if we can tolerate a little algebra. Let’s start with the model y= b 0 +b 1 X 1 +b 2 X 2 +b 3 X 1 X 2 , then add constant c to X 1 and constant d to X 2 .
y= b 0 +b 1 (X 1 +c) +b 2 (X 2 +d) +b 3 (X 1 +c) (X 2 +d)
= b 0 +b 1 X 1 +b 1 c+b 2 X 2 +b 2 d+b 3 X 1 X 2 +b 3 dX 1 +b 3 cX 2 +b 3 cd
Next, collect all the constant terms (everything that does not involve an X), then all the functions of X 1 , functions of X 2 , then functions of X 1 X 2 :
= b 0 + b 1 c + b 3 cd + b 2 d + b 1 X 1 + b 3 dX 1 + b 2 X 2 + b 3 cX 2 + b 3 X 1 X 2
and simplify:
=(b 0 + b 1 c + b 3 cd + b 2 d) + (b 1 + b 3 d)X 1 + (b 2 + b 3 c)X 2 + (b 3 )X 1 X 2
The sums in parentheses are an intercept, the coefficient for X 1 , the coefficient for X 2 and the coefficient for the X 1 X 2 interaction using the transformed variables. We can fit this model the same way we fit the original model y= b 0 +b 1 X 1 +b 2 X 2 +b 3 X 1 X 2 , since it has 2 main effects and their interaction. Compare this to the model y= b 3 X 1 X 2 (with no main effects) when we add the constants c and d to X 1 and X 2 respectively:
y= b 3 (X 1 + c)(X 2 + d)
= (b 3 cd) + (b 3 d)X 1 + (b 3 c)X 2 + (b 3 )X 1 X 2
The parentheses were added in the last step above to make it easier to see that this is a regression with an intercept, a main effect of X 1 , a main effect of X 2 and in X 1 X 2 interaction effect. But we can’t estimate this model if we try to fit y= b 3 X 1 X 2 because it has no main effects. Adding a constant in the interaction model without hierarchy creates linear effects of the predictors that aren’t captured by the model being fit. These main effects aren’t being estimated and will get lumped in with the interaction. As we’ve already seen, adding a constant will change the R 2 and p-values for then non-hierarchical model. We don’t want this!
The above explanation can be found in several papers such as Paul D. Allison’s (1977) “Testing for Interaction in Multiple Regression”. This is the same Allison who wrote my favorite books on logistic regression and survival analysis (Logistic Regression Using SAS: Theory and Application, Second Edition and Survival Analysis Using SAS: A Practical Guide, Second Edition). Check out Allison’s paper for a more thorough explanation of model hierarchy and invariance to coding.
Not maintaining model hierarchy opens the door to absurdly complex models
Let’s look back at the p-value for the model 1 interaction from the type 3 sums of squares table. The p- value is 0.0119 and if we’re using alpha=0.05 as our significance threshold, this tells us that the interaction is statistically significant. But remember what type 3 (aka partial) sums of squares are showing. It shows the additional sums of squares explained by an effect when added to a model containing all the other model effects. In other words, the p-value tells us if the effect is significant after explaining as much as possible using all the remaining variables. So, the significant Cylinders*MPG_City interaction is telling us that after accounting for the main effects, the interaction is still important.
What if we leave out the main effects, as in model 3? The significance of the interaction is no longer assessed after accounting for main effects. If model 3 is all we look at, we miss knowing whether a simpler, main effects only model could have explained the data.
Researchers normally follow the principle of parsimony or Occam’s Razor which basically says to use the simplest model that gets the job done. There’s no sense in using the model msrp=MPG_City 5 even though it is statistically significant (p=0.029) when the hierarchically well-formulated quadratic model msrp= MPG_City + MPG_City 2 (p<0.001) gets the job done and results in all higher order terms being nonsignificant. If we use a 5 th degree polynomial without the lower order terms, why not use y= X 17 ? The model y= X 17 is needlessly, even absurdly complex when a much simpler model does the job. The same logic applies to modeling an interaction without the main effects.
Now you know why to use hierarchically well-formulated models. If what you read here was intriguing, there are several places to go for more information. For learning effective model selection approaches with PROC LOGISTIC, check out the course Predictive Modeling Using Logistic Regression. For a deeper understanding of the do’s and don’ts of statistical inferences in regression modeling try the course Statistics 2: ANOVA and Regression. For a foundational course on many of the same topics, try Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression. If you are interested in a foundational statistics course geared towards predictive modeling and machine learning, try out Statistics You Need to Know for Machine Learning. Live Instructor Dates for these classes, and more, can be found by using the links provided.
See you in the next SAS class!
References
Allison, Paul D. 1977 “Testing for Interaction in Multiple Regression” American Journal of Sociology, Vol. 83: 1, pp. 144-153
Allison, Paul D. 2010. Survival Analysis Using SAS ® : A Practical Guide, Second Edition. Cary, NC: SAS Institute Inc.
Allison, Paul D. 2012. Logistic Regression Using SAS ® : Theory and Application, Second Edition. Cary, NC: SAS Institute Inc.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
- hierarchically well-formulated models
- interaction effects
- interactions
- model hierarchy
- polynomial models
- regression models
Labels:
01-17-2024
05:49 PM
1 Like
Most predictive modelers are familiar with data splitting and honest assessment. When developing a predictive model, data is typically split into training, validation, and test sets for developing and assessing models. Despite having knowledge of partitioning, many modelers make the mistake of imputing in a manner that allows data to leak from the validation and test sets into the training data. In this post, I’ll describe the problem of data leakage, explain how to impute to avoid leakage, and give advice for using SAS tools that make imputation easy to accomplish as well as some options to avoid when preparing data and modeling.
Data partitioning
First, let’s review data partitioning. The goal of predictive modeling is generalization, that is, to make predictions or score new data. So predictive modelers need to build models that function well at making predictions on new data sets. Data splitting is what best allows one to evaluate a model’s ability to accomplish this goal. When developing a predictive model, the researcher will typically split their data into a set for training the model to make predictions (training), a data set for choosing a model out of the candidates (validation), and a data set for final unbiased assessment of the champion model’s performance (test). Without data splitting, we can only assess how well the model predicts the historic data used to train the model. When we assess a model on the training data, we get an optimistically biased assessment of performance. The model will have been chosen because it does well on the particular data set used for training, which may have features not shared by data to be scored later. That’s not what we want. We want to know how accurately the model functions, how well we can expect the model to accurately make predictions when applied to previously unseen data. Data splitting allows honest assessment, that is, an unbiased assessment of the chosen model’s performance. It allows the modeler to simulate applying a model built today to new data seen tomorrow. For this reason, the modeler must ensure that any information in the validation and test sets is excluded from any type of fitting and pre-modeling processing steps such as imputation and transformations.
For a review of data partitioning using SAS Visual Analytics/Visual Statistics, see Beth Ebersol’s blog [1]. For steps on partitioning data using DATA step code and PROC SURVEY SELECT see Rick Wicklin’s blog [2].
In some cases, researchers use only training and validation sets, omitting the test set. When this occurs, it is with the understanding that the validation data model assessment measures should be considered the upper bounds of model performance. For simplicity, I’ll consider data that has been partitioned into only training and validation for the rest of this post. I’ll also assume mean imputation is the chosen imputation approach.
Data leakage
Data leakage, also called information leakage, occurs when information from the validation data contaminates the training data. This information that is introduced during training would not be accessible in a real-world application of the predictive model. This is different than a related issue, sometimes called target-leakage, in which information about the target is used as a predictor. Data leakage can lead to optimistically biased assessments of model performance, possibly making unacceptable model performance appear satisfactory. If any information from the validation data is used during training, the model gains an unfair advantage that won’t exist when it is deployed in a real-world scenario. To avoid data leakage, all data preprocessing including imputation of missing values is done solely using the training data. This simulates as closely as possible the conditions under which the model will be used in practice.
So how does leakage actually occur? Often a researcher will impute missing values prior to partitioning their data. This seems reasonable at first because imputing after partitioning requires imputing twice instead of once. But this results in leakage, with the model being trained on data that contains information that the model should not be able to see. It is effectively cheating, showing the model some of the answers it is trying to predict. Kapoor and Narayanan [3] show that several kinds of data leakage are surprisingly common in published literature across several scientific disciplines.
The best practice for imputing to avoid leakage is as follows. First, split the data into training and validation sets. For mean imputation, find the training data means, then impute the training means for missing values in both the training and validation sets.
For example, if a data set of 1 million observations was partitioned into 50% for each training and validation, and the variable means were:
Data set
Whole data (n = 1M)
Training (n = 500K)
Validation (n = 500K)
Variable mean
75
80
70
we would use 80 to replace missing values in both the training data and the validation data. When researchers impute missing values before partitioning, they are using 75 for the whole data set. This means that the model has access to information in the training data that it would not have in practice. The model is cheating by seeing information about data that should be unseen.
After partitioning then imputation, the training and validation sets may need to be recombined. Many SAS Viya statistical and machine learning modeling procedures will compute fit statistics on all data partitions if contained within the same data set.
How to impute while avoiding leakage
There are several ways to impute in SAS, and the two methods that I use most are shown below. PROC VARIMPUTE is a nice procedure for imputation using in-memory data in SAS Viya. If you’re using SAS 9, PROC STDIZE will do the job. Both allow you to impute the same values on the validation data that were used for imputing the training data.
PROC VARIMPUTE can be used easily using the point-and-click SAS Studio Imputation task. You can find it located under the SAS Viya Prepare and Explore Data tasks. After identifying variables to impute and the data sets involved, the task-generated code will need to be edited to include a CODE statement. This will produce SAS code that can impute the training means onto other data such as the validation or test sets. The SAS code produced then can be referenced in a data step using %include to impute the training means. PROC VARIMPUTE can also be used to impute medians, random numbers between the minimum and maximum values, or custom values.
proc varimpute data=mycas.training;
input var1 var2 var3 / ctech=mean;
output out=mycas.training_imputed copyvars=(_all_);
code file='/home/student/imputed_vars.sas';
run;
data mycas.validation_imputed;
set mycas.validation;
%include '/home/student/imputed_vars.sas';
run;
PROC STDIZE has an OUTSTAT option which produces a data set containing location and scale measures as well as other computed statistics. The REPONLY option in the code below replaces missing values with the variable means, without altering non-missing values. A second PROC STDIZE step gets used with the METHOD=IN() option to replace missing validation data values with the training data means. PROC STDIZE has many additional options for transforming and imputing values.
proc stdize data=work.training method=mean reponly
outstat=training_means;
var var1 var2 var3;
run;
proc stdize data=work.validation reponly method=in(training_means)
out=work.validation_imputed;
var var1 var2 var3;
run;
Further recommendations
Many of the SAS Viya regression procedures have an informative missingness option that can automatically apply mean imputation for missing values. These procedures include GENSELECT, LOGSELECT, PHSELECT, QTRSELECT, and REGSELECT and this option can be applied by specifying the INFORMATIVE option in their MODEL statements. Applying the informative missingness option does 3 things. First, it imputes the mean for missing continuous variables. Second, it creates missing indicator variables which can potentially capture the relationship between missingness and the target. Third, it treats missing values for categorical predictors as a legitimate level for analysis. These procedures also have PARTITION statements, which allow random partitioning by the procedure or specification of a partition indicator variable if one is already present in the data. Using PARTITION results in fit statistics being calculated on both training and validation sets as well as allowing a champion model to be chosen automatically based on validation data performance.
I recommend not using the PARTITION statement and INFORMATIVE option with the same PROC step for these procedures. The informative missingness option will use the whole data set means for imputation, then randomly partition the data into training and validation sets. This leads to leakage as explained earlier. Instead, use the informative missingness option when working only with training data, not a data set that contains all partitions. When I use the PARTITION statement to identify partitions and get fit statistics on each partition, it is with data that has already been imputed, using the training means for the whole data set.
In summary, for predictive modeling, we don’t want the unconditional variable means imputed for the whole data set. Instead, we want the training data means imputed for the whole data set. Imputation needs to be done after partitioning, not on the unpartitioned data. Also, in several SAS Viya regression procedures, it is best to avoid using the PARTITON statement and the INFORMATIVE option together. The INFORMATIVE option makes sense if modeling the training data only.
Learning more
Are you interested in learning more about data preparation for predictive modeling including partitioning and imputation? Then consider taking the SAS 9 class Predictive Modeling Using Logistic Regression, which covers all the topics mentioned here (and many more!) in more detail. Not only will you learn about the care needed in data preparation, this class is great preparation for attaining the SAS Statistical Business Analyst credential (https://www.sas.com/en_us/certification/credentials/advanced-analytics/statistical-business-analyst.html). If you’re modeling using SAS Viya, Supervised Machine Learning Procedures Using SAS Viya in SAS Studio is a great option too.
See you in the next SAS class!
References
[1] Beth Ebersol 2019 “Training, Validation, and Testing for Supervised Machine Learning Models” Training, Validation, and Testing for Supervised Machine Learning Models (sas.com)
[2] Rick Wicklin 2019 “Create training, validation, and test data sets in SAS” Create training, validation, and test data sets in SAS - The DO Loop
[3] Kapoor and Narayanan (2022) “Leakage and the Reproducibility Crisis in ML-based Science https://doi.org/10.48550/arXiv.2207.07048
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
- imputation
- partition
- predictive modeling
- sas studio
- SAS Viya