In SAS Enterprise Miner, we can choose variable selection. Under variable selection, we have R square method. I checked this document, which is a good explanation of how R squared method works. I am wondering in the first step, how is R square calculated. Is it that run regression on the target variable using each input, then get the R square for each input variable?
Some variable selection algorithms (often known as "stepwise") go like this:
Step 1: compute the R-square for all of the x variables, then select the variable with the highest R-squared to be the first variable included in the model. (For example, let's say X7 has the highest R-squared, the model is now Y = X7)
Step 2: compute the R-squared for all of the possible models with X7 and ONE other variable. Pick the highest R-squared to be the second variable included in the model. (For example, let's say X2 has the high R-squared in this step, the model is now Y=X7 X2)
Continue until the increase in R-squared is less than some pre-specified threshold, or until the variable added isn't statistically significant, or ... there are all sorts of variations of this algorithm.
NOTE: among all possible models with k terms, this algorithm does not guarantee to find the model with k terms that has the highest R-squared.
Hello @ycenycute ,
R-square(d) measures the strength of the relationship between your model (your input / independent variables & the functional form of the model) and the dependent variable on a convenient 0 – 100% scale.
It measures how much of the total variance in your dependent variable is explained by the model, ... the more, the better of course.
R-Squared is ubiquitous in statistics, but that is also why people are no longer critical (R-Squared is not always blissful).
The main disadvantage of R-Squared is that it will always increase if you add an additional input to your model (even if that input is not significantly contributing to the power of the model, but is only explaining a bit of noise).
Anyway, How does the R-square selection method in the Variable Selection node of Enterprise Miner work?
Read it here:
SAS® Enterprise Miner™ 15.1: Reference Help
Variable Selection Node
https://go.documentation.sas.com/doc/en/emref/15.1/n1m7rvh6yyb3mmn0zavezsher4ml.htm
In short, in the Forward Stepwise Regression, ... at each successive step, an additional input variable is chosen that provides the largest incremental increase in the model R**2.
Forward Stepwise means you start with zero inputs in the model and then you add the one that provides the biggest R**2 in the simple model (the model with one input), then you add a 2nd variable (the one that provides the largest incremental increase in the model R**2) and so on ... until stopping criteria are met.
I propose you come back to us with what you do not understand over there (i.e. in the doc).
Kind regards,
Koen
Some variable selection algorithms (often known as "stepwise") go like this:
Step 1: compute the R-square for all of the x variables, then select the variable with the highest R-squared to be the first variable included in the model. (For example, let's say X7 has the highest R-squared, the model is now Y = X7)
Step 2: compute the R-squared for all of the possible models with X7 and ONE other variable. Pick the highest R-squared to be the second variable included in the model. (For example, let's say X2 has the high R-squared in this step, the model is now Y=X7 X2)
Continue until the increase in R-squared is less than some pre-specified threshold, or until the variable added isn't statistically significant, or ... there are all sorts of variations of this algorithm.
NOTE: among all possible models with k terms, this algorithm does not guarantee to find the model with k terms that has the highest R-squared.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
