Solved: What does R square mean in variable selection?

ycenycute · Posted 09-18-2021 07:25 AM

In SAS Enterprise Miner, we can choose variable selection. Under variable selection, we have R square method. I checked this document, which is a good explanation of how R squared method works. I am wondering in the first step, how is R square calculated. Is it that run regression on the target variable using each input, then get the R square for each input variable?

PaigeMiller · Posted 09-20-2021 11:27 AM

Some variable selection algorithms (often known as "stepwise") go like this:

Step 1: compute the R-square for all of the x variables, then select the variable with the highest R-squared to be the first variable included in the model. (For example, let's say X7 has the highest R-squared, the model is now Y = X7)

Step 2: compute the R-squared for all of the possible models with X7 and ONE other variable. Pick the highest R-squared to be the second variable included in the model. (For example, let's say X2 has the high R-squared in this step, the model is now Y=X7 X2)

Continue until the increase in R-squared is less than some pre-specified threshold, or until the variable added isn't statistically significant, or ... there are all sorts of variations of this algorithm.

NOTE: among all possible models with k terms, this algorithm does not guarantee to find the model with k terms that has the highest R-squared.

--
Paige Miller

View solution in original post

sbxkoenk · Posted 09-18-2021 10:30 AM

Hello @ycenycute ,

R-square(d) measures the strength of the relationship between your model (your input / independent variables & the functional form of the model) and the dependent variable on a convenient 0 – 100% scale.

It measures how much of the total variance in your dependent variable is explained by the model, ... the more, the better of course.
R-Squared is ubiquitous in statistics, but that is also why people are no longer critical (R-Squared is not always blissful).

The main disadvantage of R-Squared is that it will always increase if you add an additional input to your model (even if that input is not significantly contributing to the power of the model, but is only explaining a bit of noise).

Anyway, How does the R-square selection method in the Variable Selection node of Enterprise Miner work?

Read it here:

SAS® Enterprise Miner™ 15.1: Reference Help

Variable Selection Node
https://go.documentation.sas.com/doc/en/emref/15.1/n1m7rvh6yyb3mmn0zavezsher4ml.htm

In short, in the Forward Stepwise Regression, ... at each successive step, an additional input variable is chosen that provides the largest incremental increase in the model R**2.

Forward Stepwise means you start with zero inputs in the model and then you add the one that provides the biggest R**2 in the simple model (the model with one input), then you add a 2nd variable (the one that provides the largest incremental increase in the model R**2) and so on ... until stopping criteria are met.

I propose you come back to us with what you do not understand over there (i.e. in the doc).

Kind regards,
Koen

ycenycute · Posted 09-20-2021 04:54 AM

Thanks for the reply. I understand the meaning of R square. I was asking how is R square calculated in the first step. In your manual, it includes three 2 steps (3 for binary target). So in the first step, is SAS running linear regression for each input on the output, and then pick those inputs whose R square is above the threshold?

PaigeMiller · Posted 09-20-2021 11:27 AM

Some variable selection algorithms (often known as "stepwise") go like this:

Step 1: compute the R-square for all of the x variables, then select the variable with the highest R-squared to be the first variable included in the model. (For example, let's say X7 has the highest R-squared, the model is now Y = X7)

Step 2: compute the R-squared for all of the possible models with X7 and ONE other variable. Pick the highest R-squared to be the second variable included in the model. (For example, let's say X2 has the high R-squared in this step, the model is now Y=X7 X2)

Continue until the increase in R-squared is less than some pre-specified threshold, or until the variable added isn't statistically significant, or ... there are all sorts of variations of this algorithm.

NOTE: among all possible models with k terms, this algorithm does not guarantee to find the model with k terms that has the highest R-squared.

--
Paige Miller

What does R square mean in variable selection?

Re: What does R square mean in variable selection?

Re: What does R square mean in variable selection?

Re: What does R square mean in variable selection?

Re: What does R square mean in variable selection?